From: Fiona Ebner <f.ebner@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
Hannes Duerr <h.duerr@proxmox.com>
Subject: Re: [pve-devel] [PATCH pve-storage 1/1] storage/plugin: implement ref-counting for disknames in get_next_vm_diskname
Date: Tue, 11 Jun 2024 15:25:27 +0200 [thread overview]
Message-ID: <283330a0-658c-4b09-9242-79dc941c0fa3@proxmox.com> (raw)
In-Reply-To: <20240403150712.262773-1-h.duerr@proxmox.com>
Do you mean "reservation" rather than "ref-counting"?
Am 03.04.24 um 17:07 schrieb Hannes Duerr:
> As Fabian has already mentioned here[0], there can be a race between two
> parallel imports. More specifically, if both imports have --allow-rename
> set and the desired name already exists, then it can happen that both
> imports get the same name. The reason for this is that we currently only
> check which names have already been assigned in the vm config and then
> use the next free one, and do not check whether a running process has
> already been assigned the same name. However, while writing and testing
> the patch, I found that this is often not a problem, as
> - in most cases the source VM config is locked and therefore only one
> process can generate a new name (migrate, clone, move disk, update
> config)
> - a name must be specified and therefore no new name is generated (pvesm
> alloc)
> - the timeframe for the race is very short.
>
> At the same time, it is possible that I have not considered all edge
> cases and that there are other cases where the race can occur,
> especially with regard to remote_migrations. You can provoke the problem
> with two parallel imports on a host where local-lvm:vm-100-disk-0
> already exists:
>
> pvesm import local-lvm:vm-100-disk-0 raw+size <exported_volume> --allow-rename 1
>
From a quick glance, it seems remote migration does one disk-import at a
time, i.e. only starts the next one once the previous one is finished.
In fact, even a malicious client could only do one import at a time,
because there only is one socket (per VM):
> $params->{unix} = "/run/qemu-server/$state->{vmid}.storage";
>
> return PVE::StorageTunnel::handle_disk_import($state, $params);
> Now that I've looked into the problem a bit, I'm not sure this patch is
> even necessary as it adds more complexity. So I wanted to ask for your
> opinion, wether you think it makes sense to add this change or not.
>
Please either use RFC for sending the patch then and/or put this
paragraph below the --- so it won't appear in the commit message.
IMHO, it would be nice to fix, but yes, it is a bunch of new code. I
think you could make it slightly simpler though, by doing the following
in the locked section:
1. iterate over reservations
-> if stale, remove
-> if not stale, add to $disk_list
2. get the next free name using the larger $disk_list
3. print the updated list of reservations including the new name
> The patch introduces a tmp file which stores the newly assigned disk
> names and the pid of the process which requested the disk name. If a
> second process is assigned the same name, it will see from the file that
> the name has already been assigned to another process, and will take the
> next available name. Reading and writing to the tmp file requires a lock
> to prevent races.
>
> [0] https://lists.proxmox.com/pipermail/pve-devel/2024-January/061526.html
>
> Signed-off-by: Hannes Duerr <h.duerr@proxmox.com>
> ---
> src/PVE/Storage/Plugin.pm | 99 ++++++++++++++++++++++++++++++++++-----
> 1 file changed, 86 insertions(+), 13 deletions(-)
>
> diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
> index 7456c8e..f76550a 100644
> --- a/src/PVE/Storage/Plugin.pm
> +++ b/src/PVE/Storage/Plugin.pm
> @@ -10,8 +10,10 @@ use File::chdir;
> use File::Path;
> use File::Basename;
> use File::stat qw();
> +use File::Copy;
>
So this is used for the single move() below? Any reason not to use
file_set_contents() like in other places? You are using locking after
all and then you would also not need the tmp file.
> use PVE::Tools qw(run_command);
> +use PVE::ProcFSTools;
> use PVE::JSONSchema qw(get_standard_option register_standard_option);
> use PVE::Cluster qw(cfs_register_file);
>
> @@ -779,23 +781,94 @@ my $get_vm_disk_number = sub {
> sub get_next_vm_diskname {
> my ($disk_list, $storeid, $vmid, $fmt, $scfg, $add_fmt_suffix) = @_;
>
> - $fmt //= '';
> - my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
> - my $suffix = $add_fmt_suffix ? ".$fmt" : '';
> + my $code = sub {
> + my $reserved_names_file = "/var/tmp/pve-reserved-volnames";
> + my $tmp_file = "/var/tmp/pve-reserved-volnames.tmp";
>
> - my $disk_ids = {};
> - foreach my $disk (@$disk_list) {
> - my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
> - $disk_ids->{$disknum} = 1 if defined($disknum);
> - }
> + $fmt //= '';
> + my $prefix = ($fmt eq 'subvol') ? 'subvol' : 'vm';
> + my $suffix = $add_fmt_suffix ? ".$fmt" : '';
> + my $disk_id;
> + my $disk_ids = {};
>
> - for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
> - if (!$disk_ids->{$i}) {
> - return "$prefix-$vmid-disk-$i$suffix";
> + foreach my $disk (@$disk_list) {
> + my $disknum = $get_vm_disk_number->($disk, $scfg, $vmid, $suffix);
> + $disk_ids->{$disknum} = 1 if defined($disknum);
> }
> - }
>
> - die "unable to allocate an image name for VM $vmid in storage '$storeid'\n"
> + for (my $i = 0; $i < $MAX_VOLUMES_PER_GUEST; $i++) {
> + if (!$disk_ids->{$i}) {
> + $disk_id = $i;
> + last;
> + }
> + }
> +
> + if (! -e $reserved_names_file) {
> + my $create_h = IO::File->new($reserved_names_file, "w") ||
> + die "can't open or create'$reserved_names_file' - $!\n";
> + print $create_h "$storeid $vmid $disk_id $$";
Not sure about using PIDs here. PIDs are re-used and can't this also be
the PID of some long-running daemon? Maybe we can use some expire time
like the next_unused_port() helper in PVE::Tools does? Can also be a few
minutes in this case, as we don't expect reservations to saturate.
[...]
> + my $new_disk_name = PVE::Tools::lock_file('/var/lock/pve-disk-names.lck', 10, $code);
The lock file is local to the node, so this won't work for shared
storages. For those, you would need track the reservations in a file on
the cluster file system.
> + die $@ if $@;
> +
> + if (!$new_disk_name) {
> + die "unable to allocate an image name for VM $vmid in storage '$storeid'\n";
> + }
> + return $new_disk_name;
> }
>
> sub find_free_diskname {
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
prev parent reply other threads:[~2024-06-11 13:25 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-03 15:07 Hannes Duerr
2024-06-11 13:25 ` Fiona Ebner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=283330a0-658c-4b09-9242-79dc941c0fa3@proxmox.com \
--to=f.ebner@proxmox.com \
--cc=h.duerr@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.