* [PATCH storage 1/2] fix #7000: rbd: handle corrupt or inaccessible images gracefully
2026-04-01 12:59 [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images Kefu Chai
@ 2026-04-01 12:59 ` Kefu Chai
2026-04-01 12:59 ` [PATCH manager 2/2] storage: content: show inaccessible RBD images in UI Kefu Chai
2026-04-10 14:56 ` [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images Lukas Sichert
2 siblings, 0 replies; 4+ messages in thread
From: Kefu Chai @ 2026-04-01 12:59 UTC (permalink / raw)
To: pve-devel
When an RBD pool contains a corrupt or orphaned image, 'rbd ls --long
--format json' emits a per-image error to stderr and omits the broken
image from its output. PVE previously discarded all stderr from this
command via 'errfunc => sub { }', so on a non-zero exit the error
surfaced as a generic 500 without identifying the problematic image.
The exit code is unreliable: it reflects only the last image processed
(last-wins), so a per-image failure may or may not propagate depending
on the order images are visited. The per-image error on stderr is the
only reliable signal.
Capture stderr from 'rbd ls --long'. When any errors are detected
(non-zero exit or per-image stderr messages), fall back to 'rbd ls
--format json' which only lists image names without opening them and
always succeeds. Images present in the name list but absent from the
detailed listing are returned with size=-1 so the caller can identify
them as inaccessible. A per-image warning naming the broken image is
emitted to aid diagnosis.
If the name-only listing also fails, a fatal error (pool not found, auth
failure, etc.) is re-propagated unchanged.
When no errors occur, behaviour is unchanged.
While at it, use '--long' instead of '-l' for readability and
consistency with the other long-form options used in the command.
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
src/PVE/Storage/RBDPlugin.pm | 87 ++++++++++++++++++++++++++++++------
1 file changed, 73 insertions(+), 14 deletions(-)
diff --git a/src/PVE/Storage/RBDPlugin.pm b/src/PVE/Storage/RBDPlugin.pm
index 7d3e7ab..92d1f63 100644
--- a/src/PVE/Storage/RBDPlugin.pm
+++ b/src/PVE/Storage/RBDPlugin.pm
@@ -222,36 +222,95 @@ sub rbd_ls {
my ($scfg, $storeid) = @_;
my $raw = '';
- my $parser = sub { $raw .= shift };
+ my @errs;
- my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '-l', '--format', 'json');
- run_rbd_command($cmd, errmsg => "rbd error", errfunc => sub { }, outfunc => $parser);
+ my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '--long', '--format', 'json');
+ eval {
+ run_rbd_command(
+ $cmd,
+ errmsg => "rbd error",
+ errfunc => sub { push(@errs, shift); },
+ outfunc => sub { $raw .= shift; },
+ );
+ };
+ my $ls_err = $@;
+ # rbd ls --long outputs a complete JSON array of successfully-opened images;
+ # images that fail to open are omitted from the output and logged to stderr,
+ # but the command still exits 0. Parse whatever we got.
my $result;
if ($raw eq '') {
$result = [];
} elsif ($raw =~ m/^(\[.*\])$/s) { # untaint
$result = JSON::decode_json($1);
- } else {
+ } elsif (!$ls_err) {
die "got unexpected data from rbd ls: '$raw'\n";
}
my $list = {};
- foreach my $el (@$result) {
- next if defined($el->{snapshot});
+ if ($result) {
+ for my $el (@$result) {
+ next if defined($el->{snapshot});
- my $image = $el->{image};
+ my $image = $el->{image};
- my ($owner) = $image =~ m/^(?:vm|base)-(\d+)-/;
- next if !defined($owner);
+ my ($owner) = $image =~ m/^(?:vm|base)-([0-9]+)-/;
+ next if !defined($owner);
+
+ $list->{$image} = {
+ name => $image,
+ size => $el->{size},
+ parent => $get_parent_image_name->($el->{parent}),
+ vmid => $owner,
+ };
+ }
+ }
- $list->{$image} = {
- name => $image,
- size => $el->{size},
- parent => $get_parent_image_name->($el->{parent}),
- vmid => $owner,
+ # rbd ls --long exit code is unreliable: it reflects only the last image
+ # processed (last-wins), so stderr is the only reliable signal for
+ # per-image errors.
+ #
+ # When any errors were detected (non-zero exit or stderr), fall back to
+ # name-only listing which never opens images and always succeeds. If the
+ # name-only listing itself fails, re-propagate as a fatal error (pool not
+ # found, auth failure, etc.).
+ if ($ls_err || @errs) {
+ my $details = @errs ? ": @errs" : "";
+ warn "rbd ls --long had errors, checking for broken images$details\n";
+
+ my $names_raw = '';
+ my $names_cmd = $rbd_cmd->($scfg, $storeid, 'ls', '--format', 'json');
+ eval {
+ run_rbd_command(
+ $names_cmd,
+ errmsg => "rbd error",
+ errfunc => sub { },
+ outfunc => sub { $names_raw .= shift; },
+ );
};
+ die $@ if $@;
+
+ my $all_names = [];
+ if ($names_raw =~ m/^(\[.*\])$/s) { # untaint
+ $all_names = eval { JSON::decode_json($1); };
+ die "invalid JSON output from 'rbd ls': $@\n" if $@;
+ }
+
+ for my $image ($all_names->@*) {
+ next if exists($list->{$image});
+
+ my ($owner) = $image =~ m/^(?:vm|base)-([0-9]+)-/;
+ next if !defined($owner);
+
+ warn "rbd image '$image' is corrupt or inaccessible\n";
+ $list->{$image} = {
+ name => $image,
+ size => -1,
+ parent => undef,
+ vmid => $owner,
+ };
+ }
}
return $list;
--
2.47.3
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images
2026-04-01 12:59 [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images Kefu Chai
2026-04-01 12:59 ` [PATCH storage 1/2] fix #7000: rbd: handle corrupt or inaccessible images gracefully Kefu Chai
2026-04-01 12:59 ` [PATCH manager 2/2] storage: content: show inaccessible RBD images in UI Kefu Chai
@ 2026-04-10 14:56 ` Lukas Sichert
2 siblings, 0 replies; 4+ messages in thread
From: Lukas Sichert @ 2026-04-10 14:56 UTC (permalink / raw)
To: Kefu Chai, pve-devel
I was able to recreate the bug by creating a container in the rbd pool
and then deleting the header using 'rados -p <rbd-pool> rm
rbd_header.<id>'. After applying the patch, the GUI correctly shows
'N/A (inaccessible)'.
The changes look good to me. Using size => -1 makes sense in this context.
Small nit:
The line
'warn "rbd ls --long had errors, checking for broken images$details\n";'
currently produces rather noisy output including timestamps and full
context, e.g.:
'rbd ls --long had errors, checking for broken images:
2026-04-10T15:12:39.025+0200 77994ffff6c0 -1
librbd::image::OpenRequest: failed to retrieve initial metadata: (2)
No such file or directory
rbd: error opening vm-104-disk-0: (2) No such file or directory'.
Since the affected image is reported later anyway, it would be cleaner
to reduce this to the core librbd error message. For example:
'my $details = join('', @errs) =~
/^(?:\S+\s+\S+\s+-?\d+\s+)?(librbd::.*?): \(/m ? ": $1" : "";'
to only print
'librbd::image::OpenRequest: failed to retrieve initial metadata'
But this exact regex implementation might be a bit fragile, as I don't
know if there is always ': (' after the specific error message.
But as this is a rather small change, feel free to use my Tested-by
and Reviewed-by in the next version, if no bigger changes are made.
Reviewed-by: Lukas Sichert <l.sichert@proxmox.com>
Tested-by: Lukas Sichert <l.sichert@proxmox.com>
On 2026-04-01 14:59, Kefu Chai wrote:
> When a Ceph RBD pool contains a corrupt or orphaned image, the PVE WebGUI
> shows a generic 500 error without identifying which image is broken.
>
> The root cause is in rbd_ls(): all stderr was discarded via
> errfunc => sub { }, so per-image error messages from librbd (which name
> the broken image) were lost. The rbd ls -l exit code is also unreliable:
> it reflects only the last image processed, so a per-image failure may or
> may not propagate to the caller depending on image order. This was
> confirmed by auditing the Ceph main and latest Squid source, and verified
> by testing against a cluster built from Ceph main HEAD.
>
> The fix captures stderr and treats any error signal (non-zero exit or
> stderr output) as a cue to run a fallback name-only 'rbd ls', which never
> opens images and succeeds for valid pools. Images present in the name list
> but absent from the detailed results are returned with size=-1. If the
> fallback also fails, the error is a fatal one (pool not found, auth
> failure) and is re-propagated as before.
>
> A few alternatives were considered for how to signal inaccessible images
> to the UI:
>
> a) Omit broken images entirely. Simple, but the storage content view would
> silently appear healthy with no indication that images are missing.
>
> b) Add a new status field (e.g. status => 'inaccessible'). Explicit and
> extensible, but requires an API schema change and all callers to handle
> the new field.
>
> c) Emit a non-fatal warning alongside the partial results. This would
> require changes to the REST framework's error model, which is not how
> other storage plugins report partial failures.
>
> d) Use size => 0 as a sentinel. No API change needed, but ambiguous since
> a newly created image can legitimately have size 0.
>
> e) Use size => -1 as a sentinel (this patch). No API schema change needed;
> the field is already type => 'integer' with no minimum constraint, and
> the value flows through the stack unchanged. The UI patch renders it as
> 'N/A (inaccessible)'. The trade-off is that -1 is an implicit convention
> rather than a proper status field, which could be formalised later.
>
> Kefu Chai (2):
> fix #7000: rbd: handle corrupt or inaccessible images gracefully
> storage: content: show inaccessible RBD images in UI
>
> src/PVE/Storage/RBDPlugin.pm | 87 ++++++++++++++++++++++++++++++------
> www/manager6/storage/ContentView.js | 7 ++++++-
> 2 files changed, 79 insertions(+), 15 deletions(-)
>
^ permalink raw reply [flat|nested] 4+ messages in thread