public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH v3 manager/storage 0/2] rbd: handle corrupt or inaccessible images gracefully
@ 2026-06-29 11:48 Kefu Chai
  2026-06-29 11:48 ` [PATCH v3 storage 1/2] fix #7000: " Kefu Chai
  2026-06-29 11:48 ` [PATCH v3 manager 2/2] storage: content: show inaccessible RBD images in UI Kefu Chai
  0 siblings, 2 replies; 3+ messages in thread
From: Kefu Chai @ 2026-06-29 11:48 UTC (permalink / raw)
  To: pve-devel

I tested this by deleting the header object of an RBD image. The
fallback triggered, the broken image appeared in 'pvesm list' with
size=-1, and a warning named it. Other images in the pool were
unaffected.

Changes since v2:
- Use traditional dereference (@$all_names) instead of postfix
  dereference ($all_names->@*) for consistency with the rest of the file.

Kefu Chai (2):
  fix #7000: rbd: handle corrupt or inaccessible images gracefully
  storage: content: show inaccessible RBD images in UI

 src/PVE/Storage/RBDPlugin.pm              | 96 ++++++++++++++++++++++++++++++------
 www/manager6/storage/ContentView.js       |  3 +++
 2 files changed, 85 insertions(+), 14 deletions(-)

-- 
2.47.3





^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v3 storage 1/2] fix #7000: rbd: handle corrupt or inaccessible images gracefully
  2026-06-29 11:48 [PATCH v3 manager/storage 0/2] rbd: handle corrupt or inaccessible images gracefully Kefu Chai
@ 2026-06-29 11:48 ` Kefu Chai
  2026-06-29 11:48 ` [PATCH v3 manager 2/2] storage: content: show inaccessible RBD images in UI Kefu Chai
  1 sibling, 0 replies; 3+ messages in thread
From: Kefu Chai @ 2026-06-29 11:48 UTC (permalink / raw)
  To: pve-devel; +Cc: Lukas Sichert

When an RBD pool contains a corrupt or orphaned image, 'rbd ls --long
--format json' emits a per-image error to stderr and omits the broken
image from its output. PVE discarded all stderr via 'errfunc => sub { }',
so a non-zero exit surfaced as a generic 500 with no indication of
which image was broken.

The exit code is unreliable: it reflects only the last image processed,
so a per-image failure may or may not propagate depending on visit order.
Stderr is the only reliable signal.

Capture stderr from 'rbd ls --long'. When the command fails or stderr
is non-empty, fall back to 'rbd ls --format json', which lists names
without opening images. Images in the name list but absent from the
detailed output get size=-1 so callers can identify them as
inaccessible. A warning names each broken image.

If the name-only listing also fails (pool not found, auth failure,
etc.), the error re-propagates unchanged. When no errors occur,
behaviour is unchanged.

While at it, use '--long' instead of '-l' for readability.

The warn message extracts the meaningful part of each librbd error from
stderr. Ceph log lines have the format:

  TIMESTAMP ADDR LEVEL librbd::Class::Method: message: (errno) strerror

The "librbd::" prefix is injected by the dout_prefix macro defined in
each librbd source file; cpp_strerror() always formats the errno as
"(N) strerror", so ": (" reliably marks the end of the human-readable
message. Duplicate messages are deduplicated in case many images fail
the same way.

Reviewed-by: Lukas Sichert <l.sichert@proxmox.com>
Tested-by: Lukas Sichert <l.sichert@proxmox.com>
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 src/PVE/Storage/RBDPlugin.pm | 96 ++++++++++++++++++++++++++++++------
 1 file changed, 82 insertions(+), 14 deletions(-)

diff --git a/src/PVE/Storage/RBDPlugin.pm b/src/PVE/Storage/RBDPlugin.pm
index b537425..d74d6d2 100644
--- a/src/PVE/Storage/RBDPlugin.pm
+++ b/src/PVE/Storage/RBDPlugin.pm
@@ -222,36 +222,104 @@ sub rbd_ls {
     my ($scfg, $storeid) = @_;
 
     my $raw = '';
-    my $parser = sub { $raw .= shift };
+    my @errs;
 
-    my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '-l', '--format', 'json');
-    run_rbd_command($cmd, errmsg => "rbd error", errfunc => sub { }, outfunc => $parser);
+    my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '--long', '--format', 'json');
+    eval {
+        run_rbd_command(
+            $cmd,
+            errmsg => "rbd error",
+            errfunc => sub { push(@errs, shift); },
+            outfunc => sub { $raw .= shift; },
+        );
+    };
+    my $ls_err = $@;
 
+    # rbd ls --long outputs a complete JSON array of successfully-opened images;
+    # images that fail to open are omitted from the output and logged to stderr,
+    # but the command still exits 0. Parse whatever we got.
     my $result;
     if ($raw eq '') {
         $result = [];
     } elsif ($raw =~ m/^(\[.*\])$/s) { # untaint
         $result = JSON::decode_json($1);
-    } else {
+    } elsif (!$ls_err) {
         die "got unexpected data from rbd ls: '$raw'\n";
     }
 
     my $list = {};
 
-    foreach my $el (@$result) {
-        next if defined($el->{snapshot});
+    if ($result) {
+        for my $el (@$result) {
+            next if defined($el->{snapshot});
+
+            my $image = $el->{image};
 
-        my $image = $el->{image};
+            my ($owner) = $image =~ m/^(?:vm|base)-([0-9]+)-/;
+            next if !defined($owner);
 
-        my ($owner) = $image =~ m/^(?:vm|base)-(\d+)-/;
-        next if !defined($owner);
+            $list->{$image} = {
+                name => $image,
+                size => $el->{size},
+                parent => $get_parent_image_name->($el->{parent}),
+                vmid => $owner,
+            };
+        }
+    }
 
-        $list->{$image} = {
-            name => $image,
-            size => $el->{size},
-            parent => $get_parent_image_name->($el->{parent}),
-            vmid => $owner,
+    # rbd ls --long exit code is unreliable: it reflects only the last image
+    # processed (last-wins), so stderr is the only reliable signal for
+    # per-image errors.
+    #
+    # When any errors were detected (non-zero exit or stderr), fall back to
+    # name-only listing which never opens images and always succeeds. If the
+    # name-only listing itself fails, re-propagate as a fatal error (pool not
+    # found, auth failure, etc.).
+    if ($ls_err || @errs) {
+        # e.g.: 2026-04-10T15:12:39.025+0200 77994ffff6c0 -1 librbd::image::OpenRequest: failed to retrieve initial metadata: (2) No such file or directory
+        # deduplicate in case multiple images fail the same way.
+        my %seen;
+        my @msgs;
+        for my $line (@errs) {
+            if ($line =~ /(librbd::.*?): \(/) {
+                push(@msgs, $1) if !$seen{$1}++;
+            }
+        }
+        my $details = @msgs ? ": " . join("; ", @msgs) : "";
+        warn "rbd ls --long had errors, checking for broken images$details\n";
+
+        my $names_raw = '';
+        my $names_cmd = $rbd_cmd->($scfg, $storeid, 'ls', '--format', 'json');
+        eval {
+            run_rbd_command(
+                $names_cmd,
+                errmsg => "rbd error",
+                errfunc => sub { },
+                outfunc => sub { $names_raw .= shift; },
+            );
         };
+        die $@ if $@;
+
+        my $all_names = [];
+        if ($names_raw =~ m/^(\[.*\])$/s) { # untaint
+            $all_names = eval { JSON::decode_json($1); };
+            die "invalid JSON output from 'rbd ls': $@\n" if $@;
+        }
+
+        for my $image (@$all_names) {
+            next if exists($list->{$image});
+
+            my ($owner) = $image =~ m/^(?:vm|base)-([0-9]+)-/;
+            next if !defined($owner);
+
+            warn "rbd image '$image' is corrupt or inaccessible\n";
+            $list->{$image} = {
+                name => $image,
+                size => -1,
+                parent => undef,
+                vmid => $owner,
+            };
+        }
     }
 
     return $list;
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v3 manager 2/2] storage: content: show inaccessible RBD images in UI
  2026-06-29 11:48 [PATCH v3 manager/storage 0/2] rbd: handle corrupt or inaccessible images gracefully Kefu Chai
  2026-06-29 11:48 ` [PATCH v3 storage 1/2] fix #7000: " Kefu Chai
@ 2026-06-29 11:48 ` Kefu Chai
  1 sibling, 0 replies; 3+ messages in thread
From: Kefu Chai @ 2026-06-29 11:48 UTC (permalink / raw)
  To: pve-devel

When a corrupt or inaccessible RBD image is returned with size=-1
(introduced in the pve-storage rbd_ls() fix for bug #7000), render
it as 'N/A (inaccessible)' instead of the nonsensical '-1 B'.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 www/manager6/storage/ContentView.js | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/www/manager6/storage/ContentView.js b/www/manager6/storage/ContentView.js
index c604cbab..7ff9d6e7 100644
--- a/www/manager6/storage/ContentView.js
+++ b/www/manager6/storage/ContentView.js
@@ -156,6 +156,9 @@ Ext.define(
                     width: 100,
                     dataIndex: 'size',
                     renderer: function (value, _meta, record) {
+                        if (value === -1) {
+                            return gettext('N/A (inaccessible)');
+                        }
                         if (value !== undefined) {
                             return Proxmox.Utils.format_size(value);
                         }
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-29 11:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29 11:48 [PATCH v3 manager/storage 0/2] rbd: handle corrupt or inaccessible images gracefully Kefu Chai
2026-06-29 11:48 ` [PATCH v3 storage 1/2] fix #7000: " Kefu Chai
2026-06-29 11:48 ` [PATCH v3 manager 2/2] storage: content: show inaccessible RBD images in UI Kefu Chai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal