public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Wolfgang Bumiller <w.bumiller@proxmox.com>
To: Dominik Csapak <d.csapak@proxmox.com>
Cc: pve-devel@lists.proxmox.com
Subject: [pve-devel] applied: [PATCH qemu-server] pci: workaround nvidia driver issue on mdev cleanup
Date: Thu, 16 Mar 2023 09:23:54 +0100	[thread overview]
Message-ID: <20230316082354.co5lx6myj4his7wk@casey.proxmox.com> (raw)
In-Reply-To: <20230224130431.1174277-1-d.csapak@proxmox.com>

applied, though I'm not too happy about it...

The moment those driver versions that don't cleanup by themselves are
"rare" enough I'd like to just drop the cleanup code from our side!

On Fri, Feb 24, 2023 at 02:04:31PM +0100, Dominik Csapak wrote:
> in some nvidia grid drivers (e.g. 14.4 and 15.x), their kernel module
> tries to clean up the mdev device when the vm is shutdown and if it
> cannot do that (e.g. becaues we already cleaned it up), their removal
> process cancels with an error such that the vgpu does still exist inside
> their book-keeping, but can't be used/recreated/freed until a reboot.
> 
> since there seems no obvious way to detect if thats the case besides
> either parsing dmesg (which is racy), or the nvidia kernel module
> version(which i'd rather not do), we simply test the pci device vendor
> for nvidia and add a 10s sleep. that should give the driver enough time
> to clean up and we will not find the path anymore and skip the cleanup.
> 
> This way, it works with both the newer and older versions of the driver
> (some of the older drivers are LTS releases, so they're still
> supported).
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
>  PVE/QemuServer.pm | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 40be44db..096e7f0d 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -6161,6 +6161,15 @@ sub cleanup_pci_devices {
>  	    # NOTE: avoid PVE::SysFSTools::pci_cleanup_mdev_device as it requires PCI ID and we
>  	    # don't want to break ABI just for this two liner
>  	    my $dev_sysfs_dir = "/sys/bus/mdev/devices/$uuid";
> +
> +	    # some nvidia vgpu driver versions want to clean the mdevs up themselves, and error
> +	    # out when we do it first. so wait for 10 seconds and then try it
> +	    my $pciid = $d->{pciid}->[0]->{id};
> +	    my $info = PVE::SysFSTools::pci_device_info("$pciid");
> +	    if ($info->{vendor} eq '10de') {
> +		sleep 10;
> +	    }
> +
>  	    PVE::SysFSTools::file_write("$dev_sysfs_dir/remove", "1") if -e $dev_sysfs_dir;
>  	}
>      }
> -- 
> 2.30.2




      reply	other threads:[~2023-03-16  8:24 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-24 13:04 [pve-devel] " Dominik Csapak
2023-03-16  8:23 ` Wolfgang Bumiller [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230316082354.co5lx6myj4his7wk@casey.proxmox.com \
    --to=w.bumiller@proxmox.com \
    --cc=d.csapak@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal