public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH pve-storage 1/1] storage/plugin: implement ref-counting for disknames in get_next_vm_diskname
@ 2024-04-03 15:07 Hannes Duerr
  2024-06-11 13:25 ` Fiona Ebner
  0 siblings, 1 reply; 2+ messages in thread
From: Hannes Duerr @ 2024-04-03 15:07 UTC (permalink / raw)
  To: pve-devel

As Fabian has already mentioned here[0], there can be a race between two
parallel imports. More specifically, if both imports have --allow-rename
set and the desired name already exists, then it can happen that both
imports get the same name. The reason for this is that we currently only
check which names have already been assigned in the vm config and then
use the next free one, and do not check whether a running process has
already been assigned the same name. However, while writing and testing
the patch, I found that this is often not a problem, as
- in most cases the source VM config is locked and therefore only one
  process can generate a new name (migrate, clone, move disk, update
  config)
- a name must be specified and therefore no new name is generated (pvesm
  alloc)
- the timeframe for the race is very short.

At the same time, it is possible that I have not considered all edge
cases and that there are other cases where the race can occur,
especially with regard to remote_migrations. You can provoke the problem
with two parallel imports on a host where local-lvm:vm-100-disk-0
already exists:

pvesm import local-lvm:vm-100-disk-0 raw+size <exported_volume> --allow-rename 1

Now that I've looked into the problem a bit, I'm not sure this patch is
even necessary as it adds more complexity. So I wanted to ask for your
opinion, wether you think it makes sense to add this change or not.

The patch introduces a tmp file which stores the newly assigned disk
names and the pid of the process which requested the disk name. If a
second process is assigned the same name, it will see from the file that
the name has already been assigned to another process, and will take the
next available name. Reading and writing to the tmp file requires a lock
to prevent races.

[0] https://lists.proxmox.com/pipermail/pve-devel/2024-January/061526.html

Signed-off-by: Hannes Duerr <h.duerr@proxmox.com>
---
 src/PVE/Storage/Plugin.pm | 99 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 86 insertions(+), 13 deletions(-)

diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
index 7456c8e..f76550a 100644
--- a/src/PVE/Storage/Plugin.pm
+++ b/src/PVE/Storage/Plugin.pm
@@ -10,8 +10,10 @@ use File::chdir;
 use File::Path;
 use File::Basename;
 use File::stat qw();
+use File::Copy;
 
 use PVE::Tools qw(run_command);
+use PVE::ProcFSTools;
 use PVE::JSONSchema qw(get_standard_option register_standard_option);
 use PVE::Cluster qw(cfs_register_file);
 
@@ -779,23 +781,94 @@ my $get_vm_disk_number = sub {
 sub get_next_vm_diskname {
     my ($disk_list, $storeid, $vmid, $fmt, $scfg, $add_fmt_suffix) = @_;
 
-    $fmt //= '';
-    my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
-    my $suffix = $add_fmt_suffix ? ".$fmt" : '';
+    my $code = sub {
+	my $reserved_names_file = "/var/tmp/pve-reserved-volnames";
+	my $tmp_file = "/var/tmp/pve-reserved-volnames.tmp";
 
-    my $disk_ids = {};
-    foreach my $disk (@$disk_list) {
-	my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
-	$disk_ids->{$disknum} = 1 if defined($disknum);
-    }
+	$fmt //= '';
+	my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
+	my $suffix = $add_fmt_suffix ? ".$fmt" : '';
+	my $disk_id;
+	my $disk_ids = {};
 
-    for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
-	if (!$disk_ids->{$i}) {
-	    return "$prefix-$vmid-disk-$i$suffix";
+	foreach my $disk (@$disk_list) {
+	    my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
+	    $disk_ids->{$disknum} = 1 if defined($disknum);
 	}
-    }
 
-    die "unable to allocate an image name for VM $vmid in storage '$storeid'\n"
+	for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
+	    if (!$disk_ids->{$i}) {
+		$disk_id = $i;
+		last;
+	    }
+	}
+
+	if (! -e $reserved_names_file) {
+	    my $create_h = IO::File->new($reserved_names_file, "w") ||
+		die "can't open or create'$reserved_names_file' - $!\n";
+	    print $create_h "$storeid $vmid $disk_id $$";
+	    $create_h->close;
+
+	    return "$prefix-$vmid-disk-$disk_id$suffix";
+	} else {
+	    my $collision;
+	    my $pid;
+
+	    my $in_h = IO::File->new($reserved_names_file, "r") ||
+		die "can't open or create'$reserved_names_file' - $!\n";
+	    my $out_h = IO::File->new($tmp_file, "w") ||
+		die "can't open or create'$tmp_file' - $!\n";
+
+	    # remove entries when the process does not exist anymore
+	    while (my $line = <$in_h>) {
+		($pid) = $line =~ /$PVE::JSONSchema::PVE_ID \d+ \d+ (\d+)/;
+		next if (!defined $pid || !defined PVE::ProcFSTools::check_process_running($pid));
+		print $out_h $line;
+	    }
+
+	    $in_h->close;
+	    $out_h->close;
+	    move($tmp_file, $reserved_names_file);
+
+	    for (my $y = $disk_id; $y < $disk_id + 5; $y++) {
+		undef $collision;
+
+		my $input_h = IO::File->new($reserved_names_file, "r") ||
+		    die "can't open or create'$reserved_names_file' - $!\n";
+
+		while (my $line = <$input_h>) {
+		    if ($disk_ids->{$y} || $line =~ /$storeid $vmid $y \d+/) {
+			$collision=1;
+			last;
+		    }
+		}
+
+		# Found new free file name so we add it to the reserved file
+		if (!defined $collision) {
+		    my $append_h = IO::File->new($reserved_names_file, "a") ||
+			die "can't open or create'$reserved_names_file' - $!\n";
+
+		    print $append_h "$storeid $vmid $disk_id $$";
+		    $append_h->close;
+
+		    return "$prefix-$vmid-disk-$disk_id$suffix";
+		}
+
+		$input_h->close;
+	    }
+
+	    return;
+	}
+
+    };
+
+    my $new_disk_name = PVE::Tools::lock_file('/var/lock/pve-disk-names.lck', 10, $code);
+    die $@ if $@;
+
+    if (!$new_disk_name) {
+	die "unable to allocate an image name for VM $vmid in storage '$storeid'\n";
+    }
+    return $new_disk_name;
 }
 
 sub find_free_diskname {
-- 
2.39.2





^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [pve-devel] [PATCH pve-storage 1/1] storage/plugin: implement ref-counting for disknames in get_next_vm_diskname
  2024-04-03 15:07 [pve-devel] [PATCH pve-storage 1/1] storage/plugin: implement ref-counting for disknames in get_next_vm_diskname Hannes Duerr
@ 2024-06-11 13:25 ` Fiona Ebner
  0 siblings, 0 replies; 2+ messages in thread
From: Fiona Ebner @ 2024-06-11 13:25 UTC (permalink / raw)
  To: Proxmox VE development discussion, Hannes Duerr

Do you mean "reservation" rather than "ref-counting"?

Am 03.04.24 um 17:07 schrieb Hannes Duerr:
> As Fabian has already mentioned here[0], there can be a race between two
> parallel imports. More specifically, if both imports have --allow-rename
> set and the desired name already exists, then it can happen that both
> imports get the same name. The reason for this is that we currently only
> check which names have already been assigned in the vm config and then
> use the next free one, and do not check whether a running process has
> already been assigned the same name. However, while writing and testing
> the patch, I found that this is often not a problem, as
> - in most cases the source VM config is locked and therefore only one
>   process can generate a new name (migrate, clone, move disk, update
>   config)
> - a name must be specified and therefore no new name is generated (pvesm
>   alloc)
> - the timeframe for the race is very short.
> 
> At the same time, it is possible that I have not considered all edge
> cases and that there are other cases where the race can occur,
> especially with regard to remote_migrations. You can provoke the problem
> with two parallel imports on a host where local-lvm:vm-100-disk-0
> already exists:
> 
> pvesm import local-lvm:vm-100-disk-0 raw+size <exported_volume> --allow-rename 1
> 

From a quick glance, it seems remote migration does one disk-import at a
time, i.e. only starts the next one once the previous one is finished.
In fact, even a malicious client could only do one import at a time,
because there only is one socket (per VM):

> 		    $params->{unix} = "/run/qemu-server/$state->{vmid}.storage";
> 
> 		    return PVE::StorageTunnel::handle_disk_import($state, $params);


> Now that I've looked into the problem a bit, I'm not sure this patch is
> even necessary as it adds more complexity. So I wanted to ask for your
> opinion, wether you think it makes sense to add this change or not.
> 

Please either use RFC for sending the patch then and/or put this
paragraph below the --- so it won't appear in the commit message.

IMHO, it would be nice to fix, but yes, it is a bunch of new code. I
think you could make it slightly simpler though, by doing the following
in the locked section:

1. iterate over reservations
-> if stale, remove
-> if not stale, add to $disk_list
2. get the next free name using the larger $disk_list
3. print the updated list of reservations including the new name

> The patch introduces a tmp file which stores the newly assigned disk
> names and the pid of the process which requested the disk name. If a
> second process is assigned the same name, it will see from the file that
> the name has already been assigned to another process, and will take the
> next available name. Reading and writing to the tmp file requires a lock
> to prevent races.
> 
> [0] https://lists.proxmox.com/pipermail/pve-devel/2024-January/061526.html
> 
> Signed-off-by: Hannes Duerr <h.duerr@proxmox.com>
> ---
>  src/PVE/Storage/Plugin.pm | 99 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 86 insertions(+), 13 deletions(-)
> 
> diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
> index 7456c8e..f76550a 100644
> --- a/src/PVE/Storage/Plugin.pm
> +++ b/src/PVE/Storage/Plugin.pm
> @@ -10,8 +10,10 @@ use File::chdir;
>  use File::Path;
>  use File::Basename;
>  use File::stat qw();
> +use File::Copy;
>  

So this is used for the single move() below? Any reason not to use
file_set_contents() like in other places? You are using locking after
all and then you would also not need the tmp file.

>  use PVE::Tools qw(run_command);
> +use PVE::ProcFSTools;
>  use PVE::JSONSchema qw(get_standard_option register_standard_option);
>  use PVE::Cluster qw(cfs_register_file);
>  
> @@ -779,23 +781,94 @@ my $get_vm_disk_number = sub {
>  sub get_next_vm_diskname {
>      my ($disk_list, $storeid, $vmid, $fmt, $scfg, $add_fmt_suffix) = @_;
>  
> -    $fmt //= '';
> -    my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
> -    my $suffix = $add_fmt_suffix ? ".$fmt" : '';
> +    my $code = sub {
> +	my $reserved_names_file = "/var/tmp/pve-reserved-volnames";
> +	my $tmp_file = "/var/tmp/pve-reserved-volnames.tmp";
>  
> -    my $disk_ids = {};
> -    foreach my $disk (@$disk_list) {
> -	my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
> -	$disk_ids->{$disknum} = 1 if defined($disknum);
> -    }
> +	$fmt //= '';
> +	my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
> +	my $suffix = $add_fmt_suffix ? ".$fmt" : '';
> +	my $disk_id;
> +	my $disk_ids = {};
>  
> -    for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
> -	if (!$disk_ids->{$i}) {
> -	    return "$prefix-$vmid-disk-$i$suffix";
> +	foreach my $disk (@$disk_list) {
> +	    my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
> +	    $disk_ids->{$disknum} = 1 if defined($disknum);
>  	}
> -    }
>  
> -    die "unable to allocate an image name for VM $vmid in storage '$storeid'\n"
> +	for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
> +	    if (!$disk_ids->{$i}) {
> +		$disk_id = $i;
> +		last;
> +	    }
> +	}
> +
> +	if (! -e $reserved_names_file) {
> +	    my $create_h = IO::File->new($reserved_names_file, "w") ||
> +		die "can't open or create'$reserved_names_file' - $!\n";
> +	    print $create_h "$storeid $vmid $disk_id $$";

Not sure about using PIDs here. PIDs are re-used and can't this also be
the PID of some long-running daemon? Maybe we can use some expire time
like the next_unused_port() helper in PVE::Tools does? Can also be a few
minutes in this case, as we don't expect reservations to saturate.

[...]

> +    my $new_disk_name = PVE::Tools::lock_file('/var/lock/pve-disk-names.lck', 10, $code);

The lock file is local to the node, so this won't work for shared
storages. For those, you would need track the reservations in a file on
the cluster file system.

> +    die $@ if $@;
> +
> +    if (!$new_disk_name) {
> +	die "unable to allocate an image name for VM $vmid in storage '$storeid'\n";
> +    }
> +    return $new_disk_name;
>  }
>  
>  sub find_free_diskname {


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-06-11 13:25 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-03 15:07 [pve-devel] [PATCH pve-storage 1/1] storage/plugin: implement ref-counting for disknames in get_next_vm_diskname Hannes Duerr
2024-06-11 13:25 ` Fiona Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal