all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images
@ 2026-04-01 12:59 Kefu Chai
  2026-04-01 12:59 ` [PATCH storage 1/2] fix #7000: rbd: handle corrupt or inaccessible images gracefully Kefu Chai
  2026-04-01 12:59 ` [PATCH manager 2/2] storage: content: show inaccessible RBD images in UI Kefu Chai
  0 siblings, 2 replies; 3+ messages in thread
From: Kefu Chai @ 2026-04-01 12:59 UTC (permalink / raw)
  To: pve-devel

When a Ceph RBD pool contains a corrupt or orphaned image, the PVE WebGUI
shows a generic 500 error without identifying which image is broken.

The root cause is in rbd_ls(): all stderr was discarded via
errfunc => sub { }, so per-image error messages from librbd (which name
the broken image) were lost. The rbd ls -l exit code is also unreliable:
it reflects only the last image processed, so a per-image failure may or
may not propagate to the caller depending on image order. This was
confirmed by auditing the Ceph main and latest Squid source, and verified
by testing against a cluster built from Ceph main HEAD.

The fix captures stderr and treats any error signal (non-zero exit or
stderr output) as a cue to run a fallback name-only 'rbd ls', which never
opens images and succeeds for valid pools. Images present in the name list
but absent from the detailed results are returned with size=-1. If the
fallback also fails, the error is a fatal one (pool not found, auth
failure) and is re-propagated as before.

A few alternatives were considered for how to signal inaccessible images
to the UI:

a) Omit broken images entirely. Simple, but the storage content view would
   silently appear healthy with no indication that images are missing.

b) Add a new status field (e.g. status => 'inaccessible'). Explicit and
   extensible, but requires an API schema change and all callers to handle
   the new field.

c) Emit a non-fatal warning alongside the partial results. This would
   require changes to the REST framework's error model, which is not how
   other storage plugins report partial failures.

d) Use size => 0 as a sentinel. No API change needed, but ambiguous since
   a newly created image can legitimately have size 0.

e) Use size => -1 as a sentinel (this patch). No API schema change needed;
   the field is already type => 'integer' with no minimum constraint, and
   the value flows through the stack unchanged. The UI patch renders it as
   'N/A (inaccessible)'. The trade-off is that -1 is an implicit convention
   rather than a proper status field, which could be formalised later.

Kefu Chai (2):
  fix #7000: rbd: handle corrupt or inaccessible images gracefully
  storage: content: show inaccessible RBD images in UI

 src/PVE/Storage/RBDPlugin.pm        | 87 ++++++++++++++++++++++++++++++------
 www/manager6/storage/ContentView.js |  7 ++++++-
 2 files changed, 79 insertions(+), 15 deletions(-)

-- 
2.47.3





^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH storage 1/2] fix #7000: rbd: handle corrupt or inaccessible images gracefully
  2026-04-01 12:59 [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images Kefu Chai
@ 2026-04-01 12:59 ` Kefu Chai
  2026-04-01 12:59 ` [PATCH manager 2/2] storage: content: show inaccessible RBD images in UI Kefu Chai
  1 sibling, 0 replies; 3+ messages in thread
From: Kefu Chai @ 2026-04-01 12:59 UTC (permalink / raw)
  To: pve-devel

When an RBD pool contains a corrupt or orphaned image, 'rbd ls --long
--format json' emits a per-image error to stderr and omits the broken
image from its output. PVE previously discarded all stderr from this
command via 'errfunc => sub { }', so on a non-zero exit the error
surfaced as a generic 500 without identifying the problematic image.

The exit code is unreliable: it reflects only the last image processed
(last-wins), so a per-image failure may or may not propagate depending
on the order images are visited. The per-image error on stderr is the
only reliable signal.

Capture stderr from 'rbd ls --long'. When any errors are detected
(non-zero exit or per-image stderr messages), fall back to 'rbd ls
--format json' which only lists image names without opening them and
always succeeds. Images present in the name list but absent from the
detailed listing are returned with size=-1 so the caller can identify
them as inaccessible. A per-image warning naming the broken image is
emitted to aid diagnosis.

If the name-only listing also fails, a fatal error (pool not found, auth
failure, etc.) is re-propagated unchanged.

When no errors occur, behaviour is unchanged.

While at it, use '--long' instead of '-l' for readability and
consistency with the other long-form options used in the command.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 src/PVE/Storage/RBDPlugin.pm | 87 ++++++++++++++++++++++++++++++------
 1 file changed, 73 insertions(+), 14 deletions(-)

diff --git a/src/PVE/Storage/RBDPlugin.pm b/src/PVE/Storage/RBDPlugin.pm
index 7d3e7ab..92d1f63 100644
--- a/src/PVE/Storage/RBDPlugin.pm
+++ b/src/PVE/Storage/RBDPlugin.pm
@@ -222,36 +222,95 @@ sub rbd_ls {
     my ($scfg, $storeid) = @_;
 
     my $raw = '';
-    my $parser = sub { $raw .= shift };
+    my @errs;
 
-    my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '-l', '--format', 'json');
-    run_rbd_command($cmd, errmsg => "rbd error", errfunc => sub { }, outfunc => $parser);
+    my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '--long', '--format', 'json');
+    eval {
+        run_rbd_command(
+            $cmd,
+            errmsg => "rbd error",
+            errfunc => sub { push(@errs, shift); },
+            outfunc => sub { $raw .= shift; },
+        );
+    };
+    my $ls_err = $@;
 
+    # rbd ls --long outputs a complete JSON array of successfully-opened images;
+    # images that fail to open are omitted from the output and logged to stderr,
+    # but the command still exits 0. Parse whatever we got.
     my $result;
     if ($raw eq '') {
         $result = [];
     } elsif ($raw =~ m/^(\[.*\])$/s) { # untaint
         $result = JSON::decode_json($1);
-    } else {
+    } elsif (!$ls_err) {
         die "got unexpected data from rbd ls: '$raw'\n";
     }
 
     my $list = {};
 
-    foreach my $el (@$result) {
-        next if defined($el->{snapshot});
+    if ($result) {
+        for my $el (@$result) {
+            next if defined($el->{snapshot});
 
-        my $image = $el->{image};
+            my $image = $el->{image};
 
-        my ($owner) = $image =~ m/^(?:vm|base)-(\d+)-/;
-        next if !defined($owner);
+            my ($owner) = $image =~ m/^(?:vm|base)-([0-9]+)-/;
+            next if !defined($owner);
+
+            $list->{$image} = {
+                name => $image,
+                size => $el->{size},
+                parent => $get_parent_image_name->($el->{parent}),
+                vmid => $owner,
+            };
+        }
+    }
 
-        $list->{$image} = {
-            name => $image,
-            size => $el->{size},
-            parent => $get_parent_image_name->($el->{parent}),
-            vmid => $owner,
+    # rbd ls --long exit code is unreliable: it reflects only the last image
+    # processed (last-wins), so stderr is the only reliable signal for
+    # per-image errors.
+    #
+    # When any errors were detected (non-zero exit or stderr), fall back to
+    # name-only listing which never opens images and always succeeds. If the
+    # name-only listing itself fails, re-propagate as a fatal error (pool not
+    # found, auth failure, etc.).
+    if ($ls_err || @errs) {
+        my $details = @errs ? ": @errs" : "";
+        warn "rbd ls --long had errors, checking for broken images$details\n";
+
+        my $names_raw = '';
+        my $names_cmd = $rbd_cmd->($scfg, $storeid, 'ls', '--format', 'json');
+        eval {
+            run_rbd_command(
+                $names_cmd,
+                errmsg => "rbd error",
+                errfunc => sub { },
+                outfunc => sub { $names_raw .= shift; },
+            );
         };
+        die $@ if $@;
+
+        my $all_names = [];
+        if ($names_raw =~ m/^(\[.*\])$/s) { # untaint
+            $all_names = eval { JSON::decode_json($1); };
+            die "invalid JSON output from 'rbd ls': $@\n" if $@;
+        }
+
+        for my $image ($all_names->@*) {
+            next if exists($list->{$image});
+
+            my ($owner) = $image =~ m/^(?:vm|base)-([0-9]+)-/;
+            next if !defined($owner);
+
+            warn "rbd image '$image' is corrupt or inaccessible\n";
+            $list->{$image} = {
+                name => $image,
+                size => -1,
+                parent => undef,
+                vmid => $owner,
+            };
+        }
     }
 
     return $list;
-- 
2.47.3





^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH manager 2/2] storage: content: show inaccessible RBD images in UI
  2026-04-01 12:59 [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images Kefu Chai
  2026-04-01 12:59 ` [PATCH storage 1/2] fix #7000: rbd: handle corrupt or inaccessible images gracefully Kefu Chai
@ 2026-04-01 12:59 ` Kefu Chai
  1 sibling, 0 replies; 3+ messages in thread
From: Kefu Chai @ 2026-04-01 12:59 UTC (permalink / raw)
  To: pve-devel

When a corrupt or inaccessible RBD image is returned with size=-1
(introduced in the pve-storage rbd_ls() fix for bug #7000), render
it as 'N/A (inaccessible)' instead of the nonsensical '-1 B'.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 www/manager6/storage/ContentView.js | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/www/manager6/storage/ContentView.js b/www/manager6/storage/ContentView.js
index 75b81e69..aa506887 100644
--- a/www/manager6/storage/ContentView.js
+++ b/www/manager6/storage/ContentView.js
@@ -187,7 +187,12 @@ Ext.define(
                 size: {
                     header: gettext('Size'),
                     width: 100,
-                    renderer: Proxmox.Utils.format_size,
+                    renderer: function (v) {
+                        if (v === -1) {
+                            return gettext('N/A (inaccessible)');
+                        }
+                        return Proxmox.Utils.format_size(v);
+                    },
                     dataIndex: 'size',
                 },
             };
-- 
2.47.3





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-01 12:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-01 12:59 [PATCH manager/storage 0/2] fix #7000: rbd: graceful handling of corrupt/inaccessible images Kefu Chai
2026-04-01 12:59 ` [PATCH storage 1/2] fix #7000: rbd: handle corrupt or inaccessible images gracefully Kefu Chai
2026-04-01 12:59 ` [PATCH manager 2/2] storage: content: show inaccessible RBD images in UI Kefu Chai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal