all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH V3 pve-storage] lvm: use blkdiscard instead cstream to saferemove drive
@ 2025-08-18 12:08 Alexandre Derumier via pve-devel
  2025-09-22 14:06 ` Fiona Ebner
  0 siblings, 1 reply; 2+ messages in thread
From: Alexandre Derumier via pve-devel @ 2025-08-18 12:08 UTC (permalink / raw)
  To: pve-devel; +Cc: Alexandre Derumier

[-- Attachment #1: Type: message/rfc822, Size: 11103 bytes --]

From: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH V3 pve-storage] lvm: use blkdiscard instead cstream to saferemove drive
Date: Mon, 18 Aug 2025 14:08:54 +0200
Message-ID: <20250818120854.2668135-1-alexandre.derumier@groupe-cyllene.com>

Current cstream implementation is pretty slow as hell, even without throttling.

use blkdiscard --zeroout instead, which is lot of magnetude faster

Another benefit is that blkdiscard is skipping already zeroed block, so for empty
temp images like snapshot, is pretty fast.

blkdiscard don't have throttling like cstream, but we can tune the step size
of zeroes pushed to the storage.
I'm using 32MB stepsize by default , like ovirt, where it seem to be the best
balance between speed and load.
https://github.com/oVirt/vdsm/commit/79f1d79058aad863ca4b6672d4a5ce2be8e48986

but it can be reduce with "saferemove_stepsize" option.

stepsize is also autoreduce to sysfs write_zeroes_max_bytes, which is the maximum
zeroing batch supported by the storage

Also adding an option "saferemove_discard", to use discard instead zeroing.

test with a 100G volume (empty):

time /usr/bin/cstream -i /dev/zero -o /dev/test/vm-100-disk-0.qcow2 -T 10 -v 1 -b 1048576

13561233408 B 12.6 GB 10.00 s 1356062979 B/s 1.26 GB/s
26021462016 B 24.2 GB 20.00 s 1301029969 B/s 1.21 GB/s
38585499648 B 35.9 GB 30.00 s 1286135343 B/s 1.20 GB/s
50998542336 B 47.5 GB 40.00 s 1274925312 B/s 1.19 GB/s
63702765568 B 59.3 GB 50.00 s 1274009877 B/s 1.19 GB/s
76721885184 B 71.5 GB 60.00 s 1278640698 B/s 1.19 GB/s
89126539264 B 83.0 GB 70.00 s 1273178488 B/s 1.19 GB/s
101666459648 B 94.7 GB 80.00 s 1270779024 B/s 1.18 GB/s
107390959616 B 100.0 GB 84.39 s 1272531142 B/s 1.19 GB/s
write: No space left on device

real    1m24.394s
user    0m0.171s
sys     1m24.052s

time blkdiscard --zeroout /dev/test/vm-100-disk-0.qcow2 -v
/dev/test/vm-100-disk-0.qcow2: Zero-filled 107390959616 bytes from the offset 0

real    0m3.641s
user    0m0.001s
sys     0m3.433s

test with a 100G volume with random data:

time blkdiscard --zeroout /dev/test/vm-100-disk-0.qcow2 -v

/dev/test/vm-112-disk-1: Zero-filled 4764729344 bytes from the offset 0
/dev/test/vm-112-disk-1: Zero-filled 4664066048 bytes from the offset 4764729344
/dev/test/vm-112-disk-1: Zero-filled 4831838208 bytes from the offset 9428795392
/dev/test/vm-112-disk-1: Zero-filled 4831838208 bytes from the offset 14260633600
/dev/test/vm-112-disk-1: Zero-filled 4831838208 bytes from the offset 19092471808
/dev/test/vm-112-disk-1: Zero-filled 4865392640 bytes from the offset 23924310016
/dev/test/vm-112-disk-1: Zero-filled 4596957184 bytes from the offset 28789702656
/dev/test/vm-112-disk-1: Zero-filled 4731174912 bytes from the offset 33386659840
/dev/test/vm-112-disk-1: Zero-filled 4294967296 bytes from the offset 38117834752
/dev/test/vm-112-disk-1: Zero-filled 4664066048 bytes from the offset 42412802048
/dev/test/vm-112-disk-1: Zero-filled 4697620480 bytes from the offset 47076868096
/dev/test/vm-112-disk-1: Zero-filled 4664066048 bytes from the offset 51774488576
/dev/test/vm-112-disk-1: Zero-filled 4261412864 bytes from the offset 56438554624
/dev/test/vm-112-disk-1: Zero-filled 4362076160 bytes from the offset 60699967488
/dev/test/vm-112-disk-1: Zero-filled 4127195136 bytes from the offset 65062043648
/dev/test/vm-112-disk-1: Zero-filled 4328521728 bytes from the offset 69189238784
/dev/test/vm-112-disk-1: Zero-filled 4731174912 bytes from the offset 73517760512
/dev/test/vm-112-disk-1: Zero-filled 4026531840 bytes from the offset 78248935424
/dev/test/vm-112-disk-1: Zero-filled 4194304000 bytes from the offset 82275467264
/dev/test/vm-112-disk-1: Zero-filled 4664066048 bytes from the offset 86469771264
/dev/test/vm-112-disk-1: Zero-filled 4395630592 bytes from the offset 91133837312
/dev/test/vm-112-disk-1: Zero-filled 3623878656 bytes from the offset 95529467904
/dev/test/vm-112-disk-1: Zero-filled 4462739456 bytes from the offset 99153346560
/dev/test/vm-112-disk-1: Zero-filled 3758096384 bytes from the offset 103616086016

real    0m23.969s
user    0m0.030s
sys     0m0.144s

Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
---
 src/PVE/Storage/LVMPlugin.pm | 74 +++++++++++++++++++++++-------------
 1 file changed, 48 insertions(+), 26 deletions(-)

diff --git a/src/PVE/Storage/LVMPlugin.pm b/src/PVE/Storage/LVMPlugin.pm
index 0416c9e..c3a864e 100644
--- a/src/PVE/Storage/LVMPlugin.pm
+++ b/src/PVE/Storage/LVMPlugin.pm
@@ -6,7 +6,7 @@ use warnings;
 use File::Basename;
 use IO::File;
 
-use PVE::Tools qw(run_command trim);
+use PVE::Tools qw(run_command file_read_firstline trim);
 use PVE::Storage::Plugin;
 use PVE::JSONSchema qw(get_standard_option);
 
@@ -287,35 +287,46 @@ my sub free_lvm_volumes {
     # we need to zero out LVM data for security reasons
     # and to allow thin provisioning
     my $zero_out_worker = sub {
-        # wipe throughput up to 10MB/s by default; may be overwritten with saferemove_throughput
-        my $throughput = '-10485760';
-        if ($scfg->{saferemove_throughput}) {
-            $throughput = $scfg->{saferemove_throughput};
-        }
         for my $name (@$volnames) {
-            print "zero-out data on image $name (/dev/$vg/del-$name)\n";
+            my $lvmpath = "/dev/$vg/del-$name";
+            print "zero-out data on image $name ($lvmpath)\n";
+
+            warn
+                "The 'saferemove_throughput' option set for '${storeid}' is deprecated and will be removed."
+                if $scfg->{saferemove_throughput};
+
+            my $stepsize = $scfg->{saferemove_stepsize} // 32;
+            $stepsize = $stepsize * 1048576;
+
+            my $bdev = readlink($lvmpath);
+            if ($bdev && $bdev =~ m|^../(dm-(\d+))|) {
+                my $sysdir = "/sys/block/$1";
+
+                my $write_zeroes_max_bytes =
+                    file_read_firstline("$sysdir/queue/write_zeroes_max_bytes") // 0;
+                ($write_zeroes_max_bytes) = $write_zeroes_max_bytes =~ m/^(\d+)$/; #untaint
+
+                #if the storage support write_zeroes but stepsize is too big, reduce the stepsize to the max possible
+                if ($write_zeroes_max_bytes > 0 && $stepsize > $write_zeroes_max_bytes) {
+                    warn
+                        "reduce stepsize to the maximum supported by the storage : $write_zeroes_max_bytes bytes\n";
+                    $stepsize = $write_zeroes_max_bytes;
+                }
+
+                my $discard_zeroes_data =
+                    file_read_firstline("$sysdir/queue/discard_zeroes_data") // 0;
+                if ($scfg->{saferemove_discard} && $discard_zeroes_data == 0) {
+                    warn "disabling saferemove_discard as the storage don't support it\n";
+                    $scfg->{saferemove_discard} = undef;
+                }
+            }
 
             my $cmd = [
-                '/usr/bin/cstream',
-                '-i',
-                '/dev/zero',
-                '-o',
-                "/dev/$vg/del-$name",
-                '-T',
-                '10',
-                '-v',
-                '1',
-                '-b',
-                '1048576',
-                '-t',
-                "$throughput",
+                '/usr/sbin/blkdiscard', $lvmpath, '-v', '--step', "${stepsize}",
             ];
-            eval {
-                run_command(
-                    $cmd,
-                    errmsg => "zero out finished (note: 'No space left on device' is ok here)",
-                );
-            };
+            push @$cmd, '--zeroout' if !$scfg->{saferemove_discard};
+
+            eval { run_command($cmd); };
             warn $@ if $@;
 
             $class->cluster_lock_storage(
@@ -376,6 +387,15 @@ sub properties {
             description => "Zero-out data when removing LVs.",
             type => 'boolean',
         },
+        saferemove_discard => {
+            description => "Wipe with discard instead zeroing.",
+            type => 'boolean',
+            default => 0,
+        },
+        saferemove_stepsize => {
+            description => "Wipe step size (default 32MB).",
+            enum => [qw(1 2 4 8 16 32)],
+        },
         saferemove_throughput => {
             description => "Wipe throughput (cstream -t parameter value).",
             type => 'string',
@@ -394,6 +414,8 @@ sub options {
         shared => { optional => 1 },
         disable => { optional => 1 },
         saferemove => { optional => 1 },
+        saferemove_discard => { optional => 1 },
+        saferemove_stepsize => { optional => 1 },
         saferemove_throughput => { optional => 1 },
         content => { optional => 1 },
         base => { fixed => 1, optional => 1 },
-- 
2.47.2



[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [pve-devel] [PATCH V3 pve-storage] lvm: use blkdiscard instead cstream to saferemove drive
  2025-08-18 12:08 [pve-devel] [PATCH V3 pve-storage] lvm: use blkdiscard instead cstream to saferemove drive Alexandre Derumier via pve-devel
@ 2025-09-22 14:06 ` Fiona Ebner
  0 siblings, 0 replies; 2+ messages in thread
From: Fiona Ebner @ 2025-09-22 14:06 UTC (permalink / raw)
  To: Proxmox VE development discussion

Am 18.08.25 um 2:09 PM schrieb Alexandre Derumier via pve-devel:
> Current cstream implementation is pretty slow as hell, even without throttling.

I'd use either "pretty slow" or "slow as hell"

> use blkdiscard --zeroout instead, which is lot of magnetude faster

should be: "a few magnitudes"

Seems like nice-to-have options given the improvement :)

Mostly nits, but also a question about REQ_OP_WRITE_ZEROES below:

> Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
> ---
>  src/PVE/Storage/LVMPlugin.pm | 74 +++++++++++++++++++++++-------------
>  1 file changed, 48 insertions(+), 26 deletions(-)
> 
> diff --git a/src/PVE/Storage/LVMPlugin.pm b/src/PVE/Storage/LVMPlugin.pm
> index 0416c9e..c3a864e 100644
> --- a/src/PVE/Storage/LVMPlugin.pm
> +++ b/src/PVE/Storage/LVMPlugin.pm
> @@ -6,7 +6,7 @@ use warnings;
>  use File::Basename;
>  use IO::File;
>  
> -use PVE::Tools qw(run_command trim);
> +use PVE::Tools qw(run_command file_read_firstline trim);
>  use PVE::Storage::Plugin;
>  use PVE::JSONSchema qw(get_standard_option);
>  
> @@ -287,35 +287,46 @@ my sub free_lvm_volumes {
>      # we need to zero out LVM data for security reasons
>      # and to allow thin provisioning
>      my $zero_out_worker = sub {
> -        # wipe throughput up to 10MB/s by default; may be overwritten with saferemove_throughput
> -        my $throughput = '-10485760';
> -        if ($scfg->{saferemove_throughput}) {
> -            $throughput = $scfg->{saferemove_throughput};
> -        }
>          for my $name (@$volnames) {
> -            print "zero-out data on image $name (/dev/$vg/del-$name)\n";
> +            my $lvmpath = "/dev/$vg/del-$name";
> +            print "zero-out data on image $name ($lvmpath)\n";
> +
> +            warn
> +                "The 'saferemove_throughput' option set for '${storeid}' is deprecated and will be removed."
> +                if $scfg->{saferemove_throughput};

Please split the string up (too long right now) and add a newline at the
end. It should be an if block rather than post if, since it's multiple
lines.

I'd also go for "and has no effect anymore." rather than "and will be
removed". Would be good to add a reminder comment in the code to remove
the option with PVE 10 (and then, for PVE 10, we can warn/error in the
pve9to10 script if the setting is still present).

> +
> +            my $stepsize = $scfg->{saferemove_stepsize} // 32;
> +            $stepsize = $stepsize * 1048576;

Minor nit, but "$stepsize * 1024 * 1024" would be more readable.

> +
> +            my $bdev = readlink($lvmpath);

I'd prefer abs_path() like we use in other places.

> +            if ($bdev && $bdev =~ m|^../(dm-(\d+))|) {

You don't need to group/capture the \d+

> +                my $sysdir = "/sys/block/$1";
> +
> +                my $write_zeroes_max_bytes =
> +                    file_read_firstline("$sysdir/queue/write_zeroes_max_bytes") // 0;

According to the docs: "The value 0 means that REQ_OP_WRITE_ZEROES
is not supported.", so it makes sense as a fallback too. But do we need
to handle that somehow below, i.e. can --zeroout still be used if the
reported value is 0? Or do we need to fall back to the old way of doing it?

> +                ($write_zeroes_max_bytes) = $write_zeroes_max_bytes =~ m/^(\d+)$/; #untaint
> +
> +                #if the storage support write_zeroes but stepsize is too big, reduce the stepsize to the max possible
> +                if ($write_zeroes_max_bytes > 0 && $stepsize > $write_zeroes_max_bytes) {
> +                    warn
> +                        "reduce stepsize to the maximum supported by the storage : $write_zeroes_max_bytes bytes\n";

Style nit: line too long

> +                    $stepsize = $write_zeroes_max_bytes;
> +                }
> +
> +                my $discard_zeroes_data =
> +                    file_read_firstline("$sysdir/queue/discard_zeroes_data") // 0;
> +                if ($scfg->{saferemove_discard} && $discard_zeroes_data == 0) {
> +                    warn "disabling saferemove_discard as the storage don't support it\n";
> +                    $scfg->{saferemove_discard} = undef;

This should be done with a local variable instead of changing the $scfg
hash.

> +                }
> +            }
>  
>              my $cmd = [
> -                '/usr/bin/cstream',
> -                '-i',
> -                '/dev/zero',
> -                '-o',
> -                "/dev/$vg/del-$name",
> -                '-T',
> -                '10',
> -                '-v',
> -                '1',
> -                '-b',
> -                '1048576',
> -                '-t',
> -                "$throughput",
> +                '/usr/sbin/blkdiscard', $lvmpath, '-v', '--step', "${stepsize}",
>              ];

Style nit: could now be one line

> -            eval {
> -                run_command(
> -                    $cmd,
> -                    errmsg => "zero out finished (note: 'No space left on device' is ok here)",
> -                );
> -            };
> +            push @$cmd, '--zeroout' if !$scfg->{saferemove_discard};
> +
> +            eval { run_command($cmd); };
>              warn $@ if $@;
>  
>              $class->cluster_lock_storage(
> @@ -376,6 +387,15 @@ sub properties {
>              description => "Zero-out data when removing LVs.",
>              type => 'boolean',
>          },
> +        saferemove_discard => {
> +            description => "Wipe with discard instead zeroing.",

Should be: "instead of zeroing"

> +            type => 'boolean',
> +            default => 0,
> +        },
> +        saferemove_stepsize => {
> +            description => "Wipe step size (default 32MB).",

Could mention that it will be capped to what the storage supports

> +            enum => [qw(1 2 4 8 16 32)],

Missing type

> +        },
>          saferemove_throughput => {
>              description => "Wipe throughput (cstream -t parameter value).",
>              type => 'string',

Please adapt the description to mention that it is deprecated and a
reminder comment in the code to remove it with PVE 10.

> @@ -394,6 +414,8 @@ sub options {
>          shared => { optional => 1 },
>          disable => { optional => 1 },
>          saferemove => { optional => 1 },
> +        saferemove_discard => { optional => 1 },
> +        saferemove_stepsize => { optional => 1 },

For new options, we try to use kebab-case rather than snake_case

>          saferemove_throughput => { optional => 1 },
>          content => { optional => 1 },
>          base => { fixed => 1, optional => 1 },

I wonder if we should turn saferemove into a property string with
enabled=BOOLEAN,mode=(zeroout|discard),stepsize=(1|2|4|8|16|32) and
enabled being the default key for backwards compat. But I'm also fine
with having the two properties.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-09-22 14:06 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-18 12:08 [pve-devel] [PATCH V3 pve-storage] lvm: use blkdiscard instead cstream to saferemove drive Alexandre Derumier via pve-devel
2025-09-22 14:06 ` Fiona Ebner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal