From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 9FA391FF13A for ; Wed, 01 Apr 2026 14:59:50 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 7AD8F1B5D4; Wed, 1 Apr 2026 15:00:16 +0200 (CEST) From: Kefu Chai To: pve-devel@lists.proxmox.com Subject: [PATCH storage 1/2] fix #7000: rbd: handle corrupt or inaccessible images gracefully Date: Wed, 1 Apr 2026 20:59:32 +0800 Message-ID: <20260401125933.3643604-2-k.chai@proxmox.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260401125933.3643604-1-k.chai@proxmox.com> References: <20260401125933.3643604-1-k.chai@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1775048325313 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.404 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: GTCTDWU2TOI7ZMD6PE27OEFUW6RUY4WG X-Message-ID-Hash: GTCTDWU2TOI7ZMD6PE27OEFUW6RUY4WG X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: When an RBD pool contains a corrupt or orphaned image, 'rbd ls --long --format json' emits a per-image error to stderr and omits the broken image from its output. PVE previously discarded all stderr from this command via 'errfunc => sub { }', so on a non-zero exit the error surfaced as a generic 500 without identifying the problematic image. The exit code is unreliable: it reflects only the last image processed (last-wins), so a per-image failure may or may not propagate depending on the order images are visited. The per-image error on stderr is the only reliable signal. Capture stderr from 'rbd ls --long'. When any errors are detected (non-zero exit or per-image stderr messages), fall back to 'rbd ls --format json' which only lists image names without opening them and always succeeds. Images present in the name list but absent from the detailed listing are returned with size=-1 so the caller can identify them as inaccessible. A per-image warning naming the broken image is emitted to aid diagnosis. If the name-only listing also fails, a fatal error (pool not found, auth failure, etc.) is re-propagated unchanged. When no errors occur, behaviour is unchanged. While at it, use '--long' instead of '-l' for readability and consistency with the other long-form options used in the command. Signed-off-by: Kefu Chai --- src/PVE/Storage/RBDPlugin.pm | 87 ++++++++++++++++++++++++++++++------ 1 file changed, 73 insertions(+), 14 deletions(-) diff --git a/src/PVE/Storage/RBDPlugin.pm b/src/PVE/Storage/RBDPlugin.pm index 7d3e7ab..92d1f63 100644 --- a/src/PVE/Storage/RBDPlugin.pm +++ b/src/PVE/Storage/RBDPlugin.pm @@ -222,36 +222,95 @@ sub rbd_ls { my ($scfg, $storeid) = @_; my $raw = ''; - my $parser = sub { $raw .= shift }; + my @errs; - my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '-l', '--format', 'json'); - run_rbd_command($cmd, errmsg => "rbd error", errfunc => sub { }, outfunc => $parser); + my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '--long', '--format', 'json'); + eval { + run_rbd_command( + $cmd, + errmsg => "rbd error", + errfunc => sub { push(@errs, shift); }, + outfunc => sub { $raw .= shift; }, + ); + }; + my $ls_err = $@; + # rbd ls --long outputs a complete JSON array of successfully-opened images; + # images that fail to open are omitted from the output and logged to stderr, + # but the command still exits 0. Parse whatever we got. my $result; if ($raw eq '') { $result = []; } elsif ($raw =~ m/^(\[.*\])$/s) { # untaint $result = JSON::decode_json($1); - } else { + } elsif (!$ls_err) { die "got unexpected data from rbd ls: '$raw'\n"; } my $list = {}; - foreach my $el (@$result) { - next if defined($el->{snapshot}); + if ($result) { + for my $el (@$result) { + next if defined($el->{snapshot}); - my $image = $el->{image}; + my $image = $el->{image}; - my ($owner) = $image =~ m/^(?:vm|base)-(\d+)-/; - next if !defined($owner); + my ($owner) = $image =~ m/^(?:vm|base)-([0-9]+)-/; + next if !defined($owner); + + $list->{$image} = { + name => $image, + size => $el->{size}, + parent => $get_parent_image_name->($el->{parent}), + vmid => $owner, + }; + } + } - $list->{$image} = { - name => $image, - size => $el->{size}, - parent => $get_parent_image_name->($el->{parent}), - vmid => $owner, + # rbd ls --long exit code is unreliable: it reflects only the last image + # processed (last-wins), so stderr is the only reliable signal for + # per-image errors. + # + # When any errors were detected (non-zero exit or stderr), fall back to + # name-only listing which never opens images and always succeeds. If the + # name-only listing itself fails, re-propagate as a fatal error (pool not + # found, auth failure, etc.). + if ($ls_err || @errs) { + my $details = @errs ? ": @errs" : ""; + warn "rbd ls --long had errors, checking for broken images$details\n"; + + my $names_raw = ''; + my $names_cmd = $rbd_cmd->($scfg, $storeid, 'ls', '--format', 'json'); + eval { + run_rbd_command( + $names_cmd, + errmsg => "rbd error", + errfunc => sub { }, + outfunc => sub { $names_raw .= shift; }, + ); }; + die $@ if $@; + + my $all_names = []; + if ($names_raw =~ m/^(\[.*\])$/s) { # untaint + $all_names = eval { JSON::decode_json($1); }; + die "invalid JSON output from 'rbd ls': $@\n" if $@; + } + + for my $image ($all_names->@*) { + next if exists($list->{$image}); + + my ($owner) = $image =~ m/^(?:vm|base)-([0-9]+)-/; + next if !defined($owner); + + warn "rbd image '$image' is corrupt or inaccessible\n"; + $list->{$image} = { + name => $image, + size => -1, + parent => undef, + vmid => $owner, + }; + } } return $list; -- 2.47.3