From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 685C71FF140 for ; Fri, 10 Apr 2026 16:56:01 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id EC1C0209D7; Fri, 10 Apr 2026 16:56:44 +0200 (CEST) Message-ID: <45e91b2d-0ea2-4122-8d10-3544fb01b08f@proxmox.com> Date: Fri, 10 Apr 2026 16:56:10 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images To: Kefu Chai , pve-devel@lists.proxmox.com References: <20260401125933.3643604-1-k.chai@proxmox.com> Content-Language: en-US From: Lukas Sichert In-Reply-To: <20260401125933.3643604-1-k.chai@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1775832900837 X-SPAM-LEVEL: Spam detection results: 0 AWL 1.243 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: 3XG7MCCC65UL7WWGGFYLQNAWNIDK4UGO X-Message-ID-Hash: 3XG7MCCC65UL7WWGGFYLQNAWNIDK4UGO X-MailFrom: l.sichert@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: I was able to recreate the bug by creating a container in the rbd pool and then deleting the header using 'rados -p rm rbd_header.'. After applying the patch, the GUI correctly shows 'N/A (inaccessible)'. The changes look good to me. Using size => -1 makes sense in this context. Small nit: The line 'warn "rbd ls --long had errors, checking for broken images$details\n";' currently produces rather noisy output including timestamps and full context, e.g.: 'rbd ls --long had errors, checking for broken images: 2026-04-10T15:12:39.025+0200 77994ffff6c0 -1 librbd::image::OpenRequest: failed to retrieve initial metadata: (2) No such file or directory rbd: error opening vm-104-disk-0: (2) No such file or directory'. Since the affected image is reported later anyway, it would be cleaner to reduce this to the core librbd error message. For example: 'my $details = join('', @errs) =~ /^(?:\S+\s+\S+\s+-?\d+\s+)?(librbd::.*?): \(/m ? ": $1" : "";' to only print 'librbd::image::OpenRequest: failed to retrieve initial metadata' But this exact regex implementation might be a bit fragile, as I don't know if there is always ': (' after the specific error message. But as this is a rather small change, feel free to use my Tested-by and Reviewed-by in the next version, if no bigger changes are made. Reviewed-by: Lukas Sichert Tested-by: Lukas Sichert On 2026-04-01 14:59, Kefu Chai wrote: > When a Ceph RBD pool contains a corrupt or orphaned image, the PVE WebGUI > shows a generic 500 error without identifying which image is broken. > > The root cause is in rbd_ls(): all stderr was discarded via > errfunc => sub { }, so per-image error messages from librbd (which name > the broken image) were lost. The rbd ls -l exit code is also unreliable: > it reflects only the last image processed, so a per-image failure may or > may not propagate to the caller depending on image order. This was > confirmed by auditing the Ceph main and latest Squid source, and verified > by testing against a cluster built from Ceph main HEAD. > > The fix captures stderr and treats any error signal (non-zero exit or > stderr output) as a cue to run a fallback name-only 'rbd ls', which never > opens images and succeeds for valid pools. Images present in the name list > but absent from the detailed results are returned with size=-1. If the > fallback also fails, the error is a fatal one (pool not found, auth > failure) and is re-propagated as before. > > A few alternatives were considered for how to signal inaccessible images > to the UI: > > a) Omit broken images entirely. Simple, but the storage content view would > silently appear healthy with no indication that images are missing. > > b) Add a new status field (e.g. status => 'inaccessible'). Explicit and > extensible, but requires an API schema change and all callers to handle > the new field. > > c) Emit a non-fatal warning alongside the partial results. This would > require changes to the REST framework's error model, which is not how > other storage plugins report partial failures. > > d) Use size => 0 as a sentinel. No API change needed, but ambiguous since > a newly created image can legitimately have size 0. > > e) Use size => -1 as a sentinel (this patch). No API schema change needed; > the field is already type => 'integer' with no minimum constraint, and > the value flows through the stack unchanged. The UI patch renders it as > 'N/A (inaccessible)'. The trade-off is that -1 is an implicit convention > rather than a proper status field, which could be formalised later. > > Kefu Chai (2): > fix #7000: rbd: handle corrupt or inaccessible images gracefully > storage: content: show inaccessible RBD images in UI > > src/PVE/Storage/RBDPlugin.pm | 87 ++++++++++++++++++++++++++++++------ > www/manager6/storage/ContentView.js | 7 ++++++- > 2 files changed, 79 insertions(+), 15 deletions(-) >