[pve-devel] [PATCH qemu-server] mediated devices: fix race condition in vm reboot

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] [PATCH qemu-server] mediated devices: fix race condition in vm reboot
@ 2024-03-07  9:33 Dominik Csapak
  2024-03-07 13:31 ` Mira Limbeck
  2024-03-08 13:18 ` [pve-devel] applied: " Thomas Lamprecht
  0 siblings, 2 replies; 3+ messages in thread
From: Dominik Csapak @ 2024-03-07  9:33 UTC (permalink / raw)
  To: pve-devel

when triggering a vm reboot from the host (via cli/api), the reboot code
is called under a guest lock, which creates a reboot request, shuts down
the vm and calls the regular cleanup code (which includes the mdev
cleanup).

in parallel the qmeventd sees the vanished vm, and starts 'qm cleanup'
which is (among other tasks) used to restart a vm with a pending reboot
request. It does this also under a guest lock, with a default timeout of
10 seconds.

Since we currently wait 10 seconds for the nvidia driver to clean the
mdev, this creates a race condition for the cleanup lock. Iow. when the
call to `qm cleanup` starts before we started to sleep for 10 seconds,
it will not be able to acquire its lock and not start the vm again.

To fix it, do two things:
* increase the timeout in `qm cleanup` to 60 seconds
  (technically this still might run into a timeout, as we can configure
  up to 16 mediated devices with up to 10 seconds sleep each, but
  realistically most users won't configure more than two or three of
  them, if even that)

* change the `sleep 10` to a loop sleeping for 1 second each before
  checking the state again. This shortens the timeout when the driver
  can clean it up in the meantime.

Further, add a bit of logging, so we can properly see in the (task) log
what is happening when.

Fixes: 49c51a60 (pci: workaround nvidia driver issue on mdev cleanup)
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 PVE/CLI/qm.pm     |  3 ++-
 PVE/QemuServer.pm | 16 ++++++++++++----
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/PVE/CLI/qm.pm b/PVE/CLI/qm.pm
index b17b4fe2..dce6c7a1 100755
--- a/PVE/CLI/qm.pm
+++ b/PVE/CLI/qm.pm
@@ -915,7 +915,8 @@ __PACKAGE__->register_method({
 	my $storecfg = PVE::Storage::config();
 	warn "Starting cleanup for $vmid\n";
 
-	PVE::QemuConfig->lock_config($vmid, sub {
+	# mdev cleanup can take a while, so wait up to 60 seconds
+	PVE::QemuConfig->lock_config_full($vmid, 60, sub {
 	    my $conf = PVE::QemuConfig->load_config ($vmid);
 	    my $pid = PVE::QemuServer::check_running ($vmid);
 	    die "vm still running\n" if $pid;
diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index b45507aa..efe93c62 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -6133,12 +6133,20 @@ sub cleanup_pci_devices {
 	    my $dev_sysfs_dir = "/sys/bus/mdev/devices/$uuid";
 
 	    # some nvidia vgpu driver versions want to clean the mdevs up themselves, and error
-	    # out when we do it first. so wait for 10 seconds and then try it
-	    if ($d->{ids}->[0]->[0]->{vendor} =~ m/^(0x)?10de$/) {
-		sleep 10;
+	    # out when we do it first. so wait for up to 10 seconds and then try it manually
+	    if ($d->{ids}->[0]->[0]->{vendor} =~ m/^(0x)?10de$/ && -e $dev_sysfs_dir) {
+		my $count = 0;
+		while (-e $dev_sysfs_dir && $count < 10) {
+		    sleep 1;
+		    $count++;
+		}
+		warn "waited $count seconds for mdev cleanup\n";
 	    }
 
-	    PVE::SysFSTools::file_write("$dev_sysfs_dir/remove", "1") if -e $dev_sysfs_dir;
+	    if (-e $dev_sysfs_dir) {
+		warn "cleaning up mediated device $uuid\n";
+		PVE::SysFSTools::file_write("$dev_sysfs_dir/remove", "1");
+	    }
 	}
     }
     PVE::QemuServer::PCI::remove_pci_reservation($vmid);
-- 
2.39.2





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [pve-devel] [PATCH qemu-server] mediated devices: fix race condition in vm reboot
  2024-03-07  9:33 [pve-devel] [PATCH qemu-server] mediated devices: fix race condition in vm reboot Dominik Csapak
@ 2024-03-07 13:31 ` Mira Limbeck
  2024-03-08 13:18 ` [pve-devel] applied: " Thomas Lamprecht
  1 sibling, 0 replies; 3+ messages in thread
From: Mira Limbeck @ 2024-03-07 13:31 UTC (permalink / raw)
  To: pve-devel

On 3/7/24 10:33, Dominik Csapak wrote:
> when triggering a vm reboot from the host (via cli/api), the reboot code
> is called under a guest lock, which creates a reboot request, shuts down
> the vm and calls the regular cleanup code (which includes the mdev
> cleanup).
> 
> in parallel the qmeventd sees the vanished vm, and starts 'qm cleanup'
> which is (among other tasks) used to restart a vm with a pending reboot
> request. It does this also under a guest lock, with a default timeout of
> 10 seconds.
> 
> Since we currently wait 10 seconds for the nvidia driver to clean the
> mdev, this creates a race condition for the cleanup lock. Iow. when the
> call to `qm cleanup` starts before we started to sleep for 10 seconds,
> it will not be able to acquire its lock and not start the vm again.
> 
> To fix it, do two things:
> * increase the timeout in `qm cleanup` to 60 seconds
>   (technically this still might run into a timeout, as we can configure
>   up to 16 mediated devices with up to 10 seconds sleep each, but
>   realistically most users won't configure more than two or three of
>   them, if even that)
> 
> * change the `sleep 10` to a loop sleeping for 1 second each before
>   checking the state again. This shortens the timeout when the driver
>   can clean it up in the meantime.
> 
> Further, add a bit of logging, so we can properly see in the (task) log
> what is happening when.
> 
> Fixes: 49c51a60 (pci: workaround nvidia driver issue on mdev cleanup)
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
>  PVE/CLI/qm.pm     |  3 ++-
>  PVE/QemuServer.pm | 16 ++++++++++++----
>  2 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/PVE/CLI/qm.pm b/PVE/CLI/qm.pm
> index b17b4fe2..dce6c7a1 100755
> --- a/PVE/CLI/qm.pm
> +++ b/PVE/CLI/qm.pm
> @@ -915,7 +915,8 @@ __PACKAGE__->register_method({
>  	my $storecfg = PVE::Storage::config();
>  	warn "Starting cleanup for $vmid\n";
>  
> -	PVE::QemuConfig->lock_config($vmid, sub {
> +	# mdev cleanup can take a while, so wait up to 60 seconds
> +	PVE::QemuConfig->lock_config_full($vmid, 60, sub {
>  	    my $conf = PVE::QemuConfig->load_config ($vmid);
>  	    my $pid = PVE::QemuServer::check_running ($vmid);
>  	    die "vm still running\n" if $pid;
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index b45507aa..efe93c62 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -6133,12 +6133,20 @@ sub cleanup_pci_devices {
>  	    my $dev_sysfs_dir = "/sys/bus/mdev/devices/$uuid";
>  
>  	    # some nvidia vgpu driver versions want to clean the mdevs up themselves, and error
> -	    # out when we do it first. so wait for 10 seconds and then try it
> -	    if ($d->{ids}->[0]->[0]->{vendor} =~ m/^(0x)?10de$/) {
> -		sleep 10;
> +	    # out when we do it first. so wait for up to 10 seconds and then try it manually
> +	    if ($d->{ids}->[0]->[0]->{vendor} =~ m/^(0x)?10de$/ && -e $dev_sysfs_dir) {
> +		my $count = 0;
> +		while (-e $dev_sysfs_dir && $count < 10) {
> +		    sleep 1;
> +		    $count++;
> +		}
> +		warn "waited $count seconds for mdev cleanup\n";
>  	    }
>  
> -	    PVE::SysFSTools::file_write("$dev_sysfs_dir/remove", "1") if -e $dev_sysfs_dir;
> +	    if (-e $dev_sysfs_dir) {
> +		warn "cleaning up mediated device $uuid\n";
> +		PVE::SysFSTools::file_write("$dev_sysfs_dir/remove", "1");
> +	    }
>  	}
>      }
>      PVE::QemuServer::PCI::remove_pci_reservation($vmid);

lgtm

Reviewed-by: Mira Limbeck <m.limbeck@proxmox.com>


Sadly I can't test it since I don't have a vGPU-capable nvidia card. But
the customer who reported it has tested the patch already and it seemed
to fix the issue without introducing other issues so far.




^ permalink raw reply	[flat|nested] 3+ messages in thread

* [pve-devel] applied: [PATCH qemu-server] mediated devices: fix race condition in vm reboot
  2024-03-07  9:33 [pve-devel] [PATCH qemu-server] mediated devices: fix race condition in vm reboot Dominik Csapak
  2024-03-07 13:31 ` Mira Limbeck
@ 2024-03-08 13:18 ` Thomas Lamprecht
  1 sibling, 0 replies; 3+ messages in thread
From: Thomas Lamprecht @ 2024-03-08 13:18 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak

Am 07/03/2024 um 10:33 schrieb Dominik Csapak:
> when triggering a vm reboot from the host (via cli/api), the reboot code
> is called under a guest lock, which creates a reboot request, shuts down
> the vm and calls the regular cleanup code (which includes the mdev
> cleanup).
> 
> in parallel the qmeventd sees the vanished vm, and starts 'qm cleanup'
> which is (among other tasks) used to restart a vm with a pending reboot
> request. It does this also under a guest lock, with a default timeout of
> 10 seconds.
> 
> Since we currently wait 10 seconds for the nvidia driver to clean the
> mdev, this creates a race condition for the cleanup lock. Iow. when the
> call to `qm cleanup` starts before we started to sleep for 10 seconds,
> it will not be able to acquire its lock and not start the vm again.
> 
> To fix it, do two things:
> * increase the timeout in `qm cleanup` to 60 seconds
>   (technically this still might run into a timeout, as we can configure
>   up to 16 mediated devices with up to 10 seconds sleep each, but
>   realistically most users won't configure more than two or three of
>   them, if even that)
> 
> * change the `sleep 10` to a loop sleeping for 1 second each before
>   checking the state again. This shortens the timeout when the driver
>   can clean it up in the meantime.
> 
> Further, add a bit of logging, so we can properly see in the (task) log
> what is happening when.
> 
> Fixes: 49c51a60 (pci: workaround nvidia driver issue on mdev cleanup)
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
>  PVE/CLI/qm.pm     |  3 ++-
>  PVE/QemuServer.pm | 16 ++++++++++++----
>  2 files changed, 14 insertions(+), 5 deletions(-)
> 
>

applied with Mira's R-b, thanks!

Albeit I amended the commit to reword the message slightly and to switch
from using warn to print for such informational messages.
As both stdout and stderr shows up in task log that should not matter,
some code might wire up $SIG{__WARN__} though, causing potentially
confusing syslog noise or even failures.




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-03-08 13:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-07  9:33 [pve-devel] [PATCH qemu-server] mediated devices: fix race condition in vm reboot Dominik Csapak
2024-03-07 13:31 ` Mira Limbeck
2024-03-08 13:18 ` [pve-devel] applied: " Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal