From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.ebner@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 07C5E8672
 for <pve-devel@lists.proxmox.com>; Mon, 21 Aug 2023 17:06:10 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id D733036C8
 for <pve-devel@lists.proxmox.com>; Mon, 21 Aug 2023 17:05:39 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Mon, 21 Aug 2023 17:05:38 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 2815B42DC5
 for <pve-devel@lists.proxmox.com>; Mon, 21 Aug 2023 17:05:38 +0200 (CEST)
Message-ID: <edd3bab8-8474-2085-45b0-b5a78d51b6e5@proxmox.com>
Date: Mon, 21 Aug 2023 17:05:37 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.14.0
Content-Language: en-US
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
 Aaron Lauterer <a.lauterer@proxmox.com>
References: <20230614111022.1432946-1-a.lauterer@proxmox.com>
From: Fiona Ebner <f.ebner@proxmox.com>
In-Reply-To: <20230614111022.1432946-1-a.lauterer@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 2.069 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -4.279 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] [PATCH v2 storage 1/2] rbd: improve handling of
 missing images
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Mon, 21 Aug 2023 15:06:10 -0000

Am 14.06.23 um 13:10 schrieb Aaron Lauterer:
> It can happen, that an RBD image isn't cleaned up 100%. Calling 'rbd ls
> -l' will then show errors that it is not possible to open the image in
> question:
> ```
> rbd: error opening vm-103-disk-1: (2) No such file or directory
> rbd: listing images failed: (2) No such file or directory
> ```
> 
> Originally we only showed the last error line which is too generic and
> doesn't give a good hint what is actually wrong.
> 
> We can improve that by catching these specific errors and add the
> problematic disk images to the returned list with a size of '-1'.
> 

What do you think about logging a warning instead, hinting that it might
be a partially removed image? The thing I'm a bit worried about is that
existing scripts/tools interacting with our API might get confused by
the -1. And if I use the UI, I don't see it with either approach,
because your next patch hides it. If I use the CLI, I'll see either the
warning or the -1 depending on the approach.

> @@ -207,13 +209,28 @@ sub rbd_ls {
>      my $raw = '';
>      my $parser = sub { $raw .= shift };
>  
> +    my $show_err = 1;
> +    my $missing_images = {};
> +    my $err_parser = sub {
> +	my $line = shift;
> +	if ($line =~ m/$missing_image_err_regex/) {
> +	    $show_err = 0;

While both might be edge cases: What if there was some other error
before this one that we should die on? Or what if another error happens
in such a way that I don't get another stderr log line? Then $show_err
will still be 0 below and the function doesn't die.

It might be slightly better to do:
1. if there was any stderr log line we don't want to ignore, die
2. if there was none, base the decision off whether the final log line
was the "rbd: listing images failed: (2) No such file or directory"

> +	    $missing_images->{$1} = 1;
> +	} elsif ($line ne "rbd: listing images failed: (2) No such file or directory") {
> +	    # this generic error is shown after the image specific "No such file..." one,
> +	    # ignore it but not other errors
> +	    $show_err = 1;
> +	    die $line;
> +	}
> +    };
> +
>      my $cmd = $rbd_cmd->($scfg, $storeid, 'ls', '-l', '--format', 'json');
>      eval {
> -	run_rbd_command($cmd, errmsg => "rbd error", errfunc => sub {}, outfunc => $parser);
> +	run_rbd_command($cmd, errmsg => "rbd error", errfunc => $err_parser, outfunc => $parser);
>      };
>      my $err = $@;
>  
> -    die $err if $err && $err !~ m/doesn't contain rbd images/ ;
> +    die $err if $err && $show_err && $err !~ m/doesn't contain rbd images/ ;
>  
The "doesn't contain rbd images" bit could also be added to the
err_parser() :)

>      my $result;
>      if ($raw eq '') {
> @@ -224,6 +241,13 @@ sub rbd_ls {
>  	die "got unexpected data from rbd ls: '$raw'\n";
>      }
>  
> +    for my $image (keys %$missing_images) {
> +	push @$result, {
> +	    image => $image,
> +	    size => -1,
> +	};
> +    }
> +
>      my $list = {};
>  
>      foreach my $el (@$result) {
> @@ -251,7 +275,20 @@ sub rbd_ls_snap {
>      my $cmd = $rbd_cmd->($scfg, $storeid, 'snap', 'ls', $name, '--format', 'json');
>  
>      my $raw = '';
> -    run_rbd_command($cmd, errmsg => "rbd error", errfunc => sub {}, outfunc => sub { $raw .= shift; });
> +    my $show_err = 0;

Similar to the above, but this can happen more easily I think: What if
there is no stderr log line, but the command fails?

Slightly better:
1. if we got no log lines at all, but command failed, die
2. if there was any stderr log line we don't want to ignore, also die
3. If we only got log lines we want to ignore, don't die

> +    my $err_parser = sub {
> +	my $line = shift;
> +	if ($line !~ m/$missing_image_err_regex/) {
> +	    $show_err = 1;
> +	    die $line;
> +	}
> +    };
> +    eval {
> +	run_rbd_command($cmd, errmsg => "rbd error", errfunc => $err_parser, outfunc => sub { $raw .= shift; });
> +    };
> +    my $err = $@;
> +    die $err if $err && $show_err;
> +    return {} if $err && !$show_err; # could not open image, probably missing
>  
>      my $list;
>      if ($raw =~ m/^(\[.*\])$/s) { # untaint
> @@ -633,10 +670,13 @@ sub free_image {
>  
>      $class->deactivate_volume($storeid, $scfg, $volname);
>  
> -    my $cmd = $rbd_cmd->($scfg, $storeid, 'snap', 'purge',  $name);
> -    run_rbd_command($cmd, errmsg => "rbd snap purge '$name' error");
>  
> -    $cmd = $rbd_cmd->($scfg, $storeid, 'rm', $name);
> +    if (keys %{$snaps}) {
> +	my $cmd = $rbd_cmd->($scfg, $storeid, 'snap', 'purge',  $name);
> +	run_rbd_command($cmd, errmsg => "rbd snap purge '$name' error");
> +    }
> +
> +    my $cmd = $rbd_cmd->($scfg, $storeid, 'rm', $name);
>      run_rbd_command($cmd, errmsg => "rbd rm '$name' error");
>  
>      return undef;

Does the 'snap purge' command on such a partially removed image also
fail? If that was the motivation for this change, please mention it in
the commit message. Otherwise, it can be it's own patch ;)