Re: [pve-devel] [PATCH pve-storage 1/1] storage/plugin: implement ref-counting for disknames in get_next_vm_diskname

From: Fiona Ebner <f.ebner@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
	Hannes Duerr <h.duerr@proxmox.com>
Subject: Re: [pve-devel] [PATCH pve-storage 1/1] storage/plugin: implement ref-counting for disknames in get_next_vm_diskname
Date: Tue, 11 Jun 2024 15:25:27 +0200	[thread overview]
Message-ID: <283330a0-658c-4b09-9242-79dc941c0fa3@proxmox.com> (raw)
In-Reply-To: <20240403150712.262773-1-h.duerr@proxmox.com>

Do you mean "reservation" rather than "ref-counting"?

Am 03.04.24 um 17:07 schrieb Hannes Duerr:
> As Fabian has already mentioned here[0], there can be a race between two
> parallel imports. More specifically, if both imports have --allow-rename
> set and the desired name already exists, then it can happen that both
> imports get the same name. The reason for this is that we currently only
> check which names have already been assigned in the vm config and then
> use the next free one, and do not check whether a running process has
> already been assigned the same name. However, while writing and testing
> the patch, I found that this is often not a problem, as
> - in most cases the source VM config is locked and therefore only one
>   process can generate a new name (migrate, clone, move disk, update
>   config)
> - a name must be specified and therefore no new name is generated (pvesm
>   alloc)
> - the timeframe for the race is very short.
> 
> At the same time, it is possible that I have not considered all edge
> cases and that there are other cases where the race can occur,
> especially with regard to remote_migrations. You can provoke the problem
> with two parallel imports on a host where local-lvm:vm-100-disk-0
> already exists:
> 
> pvesm import local-lvm:vm-100-disk-0 raw+size <exported_volume> --allow-rename 1
> 

From a quick glance, it seems remote migration does one disk-import at a
time, i.e. only starts the next one once the previous one is finished.
In fact, even a malicious client could only do one import at a time,
because there only is one socket (per VM):

> 		    $params->{unix} = "/run/qemu-server/$state->{vmid}.storage";
> 
> 		    return PVE::StorageTunnel::handle_disk_import($state, $params);

> Now that I've looked into the problem a bit, I'm not sure this patch is
> even necessary as it adds more complexity. So I wanted to ask for your
> opinion, wether you think it makes sense to add this change or not.
> 

Please either use RFC for sending the patch then and/or put this
paragraph below the --- so it won't appear in the commit message.

IMHO, it would be nice to fix, but yes, it is a bunch of new code. I
think you could make it slightly simpler though, by doing the following
in the locked section:

1. iterate over reservations
-> if stale, remove
-> if not stale, add to $disk_list
2. get the next free name using the larger $disk_list
3. print the updated list of reservations including the new name

> The patch introduces a tmp file which stores the newly assigned disk
> names and the pid of the process which requested the disk name. If a
> second process is assigned the same name, it will see from the file that
> the name has already been assigned to another process, and will take the
> next available name. Reading and writing to the tmp file requires a lock
> to prevent races.
> 
> [0] https://lists.proxmox.com/pipermail/pve-devel/2024-January/061526.html
> 
> Signed-off-by: Hannes Duerr <h.duerr@proxmox.com>
> ---
>  src/PVE/Storage/Plugin.pm | 99 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 86 insertions(+), 13 deletions(-)
> 
> diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
> index 7456c8e..f76550a 100644
> --- a/src/PVE/Storage/Plugin.pm
> +++ b/src/PVE/Storage/Plugin.pm
> @@ -10,8 +10,10 @@ use File::chdir;
>  use File::Path;
>  use File::Basename;
>  use File::stat qw();
> +use File::Copy;
>  

So this is used for the single move() below? Any reason not to use
file_set_contents() like in other places? You are using locking after
all and then you would also not need the tmp file.

>  use PVE::Tools qw(run_command);
> +use PVE::ProcFSTools;
>  use PVE::JSONSchema qw(get_standard_option register_standard_option);
>  use PVE::Cluster qw(cfs_register_file);
>  
> @@ -779,23 +781,94 @@ my $get_vm_disk_number = sub {
>  sub get_next_vm_diskname {
>      my ($disk_list, $storeid, $vmid, $fmt, $scfg, $add_fmt_suffix) = @_;
>  
> -    $fmt //= '';
> -    my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
> -    my $suffix = $add_fmt_suffix ? ".$fmt" : '';
> +    my $code = sub {
> +	my $reserved_names_file = "/var/tmp/pve-reserved-volnames";
> +	my $tmp_file = "/var/tmp/pve-reserved-volnames.tmp";
>  
> -    my $disk_ids = {};
> -    foreach my $disk (@$disk_list) {
> -	my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
> -	$disk_ids->{$disknum} = 1 if defined($disknum);
> -    }
> +	$fmt //= '';
> +	my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
> +	my $suffix = $add_fmt_suffix ? ".$fmt" : '';
> +	my $disk_id;
> +	my $disk_ids = {};
>  
> -    for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
> -	if (!$disk_ids->{$i}) {
> -	    return "$prefix-$vmid-disk-$i$suffix";
> +	foreach my $disk (@$disk_list) {
> +	    my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
> +	    $disk_ids->{$disknum} = 1 if defined($disknum);
>  	}
> -    }
>  
> -    die "unable to allocate an image name for VM $vmid in storage '$storeid'\n"
> +	for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
> +	    if (!$disk_ids->{$i}) {
> +		$disk_id = $i;
> +		last;
> +	    }
> +	}
> +
> +	if (! -e $reserved_names_file) {
> +	    my $create_h = IO::File->new($reserved_names_file, "w") ||
> +		die "can't open or create'$reserved_names_file' - $!\n";
> +	    print $create_h "$storeid $vmid $disk_id $$";

Not sure about using PIDs here. PIDs are re-used and can't this also be
the PID of some long-running daemon? Maybe we can use some expire time
like the next_unused_port() helper in PVE::Tools does? Can also be a few
minutes in this case, as we don't expect reservations to saturate.

[...]

> +    my $new_disk_name = PVE::Tools::lock_file('/var/lock/pve-disk-names.lck', 10, $code);

The lock file is local to the node, so this won't work for shared
storages. For those, you would need track the reservations in a file on
the cluster file system.

> +    die $@ if $@;
> +
> +    if (!$new_disk_name) {
> +	die "unable to allocate an image name for VM $vmid in storage '$storeid'\n";
> +    }
> +    return $new_disk_name;
>  }
>  
>  sub find_free_diskname {

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel