public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes
@ 2024-08-06 12:21 Dominik Csapak
  2024-08-06 12:21 ` [pve-devel] [PATCH common 1/2] SysFSTools: handle new nvidia syfsapi as mdev Dominik Csapak
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:21 UTC (permalink / raw)
  To: pve-devel

For many new cards, nvidia changed the kernel interface since kernel
verion 6.8. Instead of using mediated devices, they provide their own
api.

This series adapts to that, with no required change to the vm config,
and only minimal changes to our api.

The biggest change is that the mdev types can now be queried on
/nodes/NODE/hardware/pci/<pciid-or-mapping/mdev either via a pci id
(like it was before) or via the name of a pci mapping (now checks all
local devices from that mapping)

A thing to improve could be to parse the available vgpu types from
nvidia-smi instead of the sysfs, since that not always contains all
types (see the common patch 1/2 for details)

We could abstract the code that deals with different types probably a
bit more, but for me it seems Ok for now, and finding a good API for
that is hard with only 3 modes that are very different from each other
(raw/mdev/nvidia).

qemu-server patches depend on the common patches, but the manager patch
does not rely on any other in this series. It is required though
for the user to be able to select types (in certain conditions).

note that this series requires my previous patch to the sysfstools to
improve write reliability[0], otherwise the cleanup or creation may
fail.

0: https://lists.proxmox.com/pipermail/pve-devel/2024-July/064814.html

pve-common:

Dominik Csapak (2):
  SysFSTools: handle new nvidia syfsapi as mdev
  SysFSTools: lscpi: move mdev and iommugroup check outside of verbose

 src/PVE/SysFSTools.pm | 83 ++++++++++++++++++++++++++-----------------
 1 file changed, 51 insertions(+), 32 deletions(-)

qemu-server:

Dominik Csapak (3):
  pci: choose devices: don't reserve pciids when vm is already running
  pci: remove pci reservation: optionally give list of ids to remove
  pci: mdev: adapt to nvidia interface with kernel >= 6.8

 PVE/QemuServer.pm                | 30 +++++++++--
 PVE/QemuServer/PCI.pm            | 92 +++++++++++++++++++++++++++++---
 test/run_config2command_tests.pl |  8 ++-
 3 files changed, 118 insertions(+), 12 deletions(-)

pve-manager:

Dominik Csapak (1):
  api/ui: improve mdev listing for pci mappings

 PVE/API2/Hardware/PCI.pm     | 45 +++++++++++++++++++++++++++++-------
 www/manager6/qemu/PCIEdit.js | 12 +---------
 2 files changed, 38 insertions(+), 19 deletions(-)

-- 
2.39.2


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] [PATCH common 1/2] SysFSTools: handle new nvidia syfsapi as mdev
  2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
@ 2024-08-06 12:21 ` Dominik Csapak
  2024-08-06 12:21 ` [pve-devel] [PATCH common 2/2] SysFSTools: lscpi: move mdev and iommugroup check outside of verbose Dominik Csapak
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:21 UTC (permalink / raw)
  To: pve-devel

with kernel 6.8 NVIDIAs vGPU driver has a different api than the
previous 'mediated devices'. Adapt our sysfcode to also recognize this
for the 'mdev' paths and add another 'nvidia' property so we can detect
this.

Also parse the new api when they exist instead of the mediated devices.

The biggest difference to the existing mdev api for our use is that the
devices don't report all generally available devices, only the
createable ones. So if a user wants to configure a VM, the selection is
restricted by what may currently run on the GPU (depending ont the exact
settings, e.g. mixed mode gpus where different models can be mixed on a
single GPU; not the default though)

We could overcome this, when we'd parse the general info from the
'nvidia-smi' tool, though I'm currently unsure if that interface is
stable and intended to be parsed (there is no json output or similar
AFAIK)

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
best viewed with '-w'
 src/PVE/SysFSTools.pm | 65 ++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 23 deletions(-)

diff --git a/src/PVE/SysFSTools.pm b/src/PVE/SysFSTools.pm
index 8eb9f2e..25b2f31 100644
--- a/src/PVE/SysFSTools.pm
+++ b/src/PVE/SysFSTools.pm
@@ -112,6 +112,10 @@ sub lspci {
 
 	    if (-d "$devdir/mdev_supported_types") {
 		$res->{mdev} = 1;
+	    } elsif (-d "$devdir/nvidia") {
+		# nvidia driver for kernel 6.8 or higher
+		$res->{mdev} = 1; # for api compatibility
+		$res->{nvidia} = 1;
 	    }
 
 	    my $device_hash = $ids->{$vendor}->{devices}->{$device} // {};
@@ -159,31 +163,46 @@ sub get_mdev_types {
 
     my $types = [];
 
-    my $mdev_path = "$pcisysfs/devices/$id/mdev_supported_types";
-    if (!-d $mdev_path) {
-	return $types;
+    my $dev_path = "$pcisysfs/devices/$id";
+    my $mdev_path = "$dev_path/mdev_supported_types";
+    my $nvidia_path = "$dev_path/nvidia/creatable_vgpu_types";
+    if (-d $mdev_path) {
+	dir_glob_foreach($mdev_path, '[^\.].*', sub {
+	    my ($type) = @_;
+
+	    my $type_path = "$mdev_path/$type";
+
+	    my $available = int(file_read_firstline("$type_path/available_instances"));
+	    my $description = PVE::Tools::file_get_contents("$type_path/description");
+
+	    my $entry = {
+		type => $type,
+		description => $description,
+		available => $available,
+	    };
+
+	    my $name = file_read_firstline("$type_path/name");
+	    $entry->{name} = $name if defined($name);
+
+	    push @$types, $entry;
+	});
+    } elsif (-f $nvidia_path) {
+	my $creatable = PVE::Tools::file_get_contents($nvidia_path);
+	for my $line (split("\n", $creatable)) {
+	    next if $line =~ m/^ID/; # header
+	    next if $line !~ m/^(.*?)\s*:\s*(.*)$/;
+	    my $id = $1;
+	    my $name = $2;
+
+	    push $types->@*, {
+		type => "nvidia-$id", # backwards compatibility
+		description => "", # TODO, read from xml/nvidia-smi ?
+		available => 1,
+		name  => $name,
+	    }
+	}
     }
 
-    dir_glob_foreach($mdev_path, '[^\.].*', sub {
-	my ($type) = @_;
-
-	my $type_path = "$mdev_path/$type";
-
-	my $available = int(file_read_firstline("$type_path/available_instances"));
-	my $description = PVE::Tools::file_get_contents("$type_path/description");
-
-	my $entry = {
-	    type => $type,
-	    description => $description,
-	    available => $available,
-	};
-
-	my $name = file_read_firstline("$type_path/name");
-	$entry->{name} = $name if defined($name);
-
-	push @$types, $entry;
-    });
-
     return $types;
 }
 
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] [PATCH common 2/2] SysFSTools: lscpi: move mdev and iommugroup check outside of verbose
  2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
  2024-08-06 12:21 ` [pve-devel] [PATCH common 1/2] SysFSTools: handle new nvidia syfsapi as mdev Dominik Csapak
@ 2024-08-06 12:21 ` Dominik Csapak
  2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 1/3] pci: choose devices: don't reserve pciids when vm is already running Dominik Csapak
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:21 UTC (permalink / raw)
  To: pve-devel

they should not be expensive (only reading/file checking in sysfs; the
parsed vendor/id names are not required) so we should include them
always.

We need at least the mdev part later at a point where we're not
interested in the rest of the verbose mode.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
best viewed with -w
 src/PVE/SysFSTools.pm | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/PVE/SysFSTools.pm b/src/PVE/SysFSTools.pm
index 25b2f31..0bde6d7 100644
--- a/src/PVE/SysFSTools.pm
+++ b/src/PVE/SysFSTools.pm
@@ -103,21 +103,21 @@ sub lspci {
 	    return;
 	}
 
-	if ($verbose) {
-	    $res->{iommugroup} = -1;
-	    if (-e "$devdir/iommu_group") {
-		my ($iommugroup) = (readlink("$devdir/iommu_group") =~ m/\/(\d+)$/);
-		$res->{iommugroup} = int($iommugroup);
-	    }
+	$res->{iommugroup} = -1;
+	if (-e "$devdir/iommu_group") {
+	    my ($iommugroup) = (readlink("$devdir/iommu_group") =~ m/\/(\d+)$/);
+	    $res->{iommugroup} = int($iommugroup);
+	}
 
-	    if (-d "$devdir/mdev_supported_types") {
-		$res->{mdev} = 1;
-	    } elsif (-d "$devdir/nvidia") {
-		# nvidia driver for kernel 6.8 or higher
-		$res->{mdev} = 1; # for api compatibility
-		$res->{nvidia} = 1;
-	    }
+	if (-d "$devdir/mdev_supported_types") {
+	    $res->{mdev} = 1;
+	} elsif (-d "$devdir/nvidia") {
+	    # nvidia driver for kernel 6.8 or higher
+	    $res->{mdev} = 1; # for api compatibility
+	    $res->{nvidia} = 1;
+	}
 
+	if ($verbose) {
 	    my $device_hash = $ids->{$vendor}->{devices}->{$device} // {};
 
 	    my $sub_vendor = file_read_firstline("$devdir/subsystem_vendor");
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] [PATCH qemu-server 1/3] pci: choose devices: don't reserve pciids when vm is already running
  2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
  2024-08-06 12:21 ` [pve-devel] [PATCH common 1/2] SysFSTools: handle new nvidia syfsapi as mdev Dominik Csapak
  2024-08-06 12:21 ` [pve-devel] [PATCH common 2/2] SysFSTools: lscpi: move mdev and iommugroup check outside of verbose Dominik Csapak
@ 2024-08-06 12:22 ` Dominik Csapak
  2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 2/3] pci: remove pci reservation: optionally give list of ids to remove Dominik Csapak
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:22 UTC (permalink / raw)
  To: pve-devel

since the only way this could happen is when we're being called
from 'qm showcmd' and there we don't want to reserve or create anything.

In case the vm was not running, we actually reserve the devices, so we
want to call 'cleanup_pci_devices' after to remove those again. This
minimizes the timespan where those devices are not available for real vm
starts.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
alternatively we could pass the info through that we're in 'showcmd',
but since that's a relatively long call chain and the involved functions
already have a lot of parameters, i opted for this. I'm not opposed to
another solution, if wanted, though.

 PVE/QemuServer.pm                | 5 +++++
 PVE/QemuServer/PCI.pm            | 9 +++++++--
 test/run_config2command_tests.pl | 5 ++++-
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index b26da505..b2cbe00e 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -6129,6 +6129,11 @@ sub vm_commandline {
     my $defaults = load_defaults();
 
     my $cmd = config_to_command($storecfg, $vmid, $conf, $defaults, $forcemachine, $forcecpu);
+    # if the vm is not running, we need to clean up the reserved/created devices
+    if (!PVE::QemuServer::Helpers::vm_running_locally($vmid)) {
+	eval { cleanup_pci_devices($vmid, $conf) };
+	warn $@ if $@;
+    }
 
     return PVE::Tools::cmd2string($cmd);
 }
diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
index 1673041b..97eb2165 100644
--- a/PVE/QemuServer/PCI.pm
+++ b/PVE/QemuServer/PCI.pm
@@ -523,6 +523,9 @@ sub parse_hostpci_devices {
 my sub choose_hostpci_devices {
     my ($devices, $vmid) = @_;
 
+    # if the vm is running, we must be in 'showcmd', so don't actually reserve or create anything
+    my $is_running = PVE::QemuServer::Helpers::vm_running_locally($vmid) ? 1 : 0;
+
     my $used = {};
 
     my $add_used_device = sub {
@@ -555,8 +558,10 @@ my sub choose_hostpci_devices {
 	    my $ids = [map { $_->{id} } @$alternative];
 
 	    next if grep { defined($used->{$_}) } @$ids; # already used
-	    eval { reserve_pci_usage($ids, $vmid, 10, undef) };
-	    next if $@;
+	    if (!$is_running) {
+		eval { reserve_pci_usage($ids, $vmid, 10, undef) };
+		next if $@;
+	    }
 
 	    # found one that is not used or reserved
 	    $add_used_device->($alternative);
diff --git a/test/run_config2command_tests.pl b/test/run_config2command_tests.pl
index d48ef562..9b5e87ff 100755
--- a/test/run_config2command_tests.pl
+++ b/test/run_config2command_tests.pl
@@ -196,7 +196,10 @@ $qemu_server_module->mock(
     },
     get_initiator_name => sub {
 	return 'iqn.1993-08.org.debian:01:aabbccddeeff';
-    }
+    },
+    cleanup_pci_devices => {
+	# do nothing
+    },
 );
 
 my $qemu_server_config;
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] [PATCH qemu-server 2/3] pci: remove pci reservation: optionally give list of ids to remove
  2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
                   ` (2 preceding siblings ...)
  2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 1/3] pci: choose devices: don't reserve pciids when vm is already running Dominik Csapak
@ 2024-08-06 12:22 ` Dominik Csapak
  2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 3/3] pci: mdev: adapt to nvidia interface with kernel >= 6.8 Dominik Csapak
  2024-08-06 12:22 ` [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings Dominik Csapak
  5 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:22 UTC (permalink / raw)
  To: pve-devel

so we can partially release ids again. This will be necessary for
NVIDIAs new sysfs api

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 PVE/QemuServer/PCI.pm | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
index 97eb2165..ae180e08 100644
--- a/PVE/QemuServer/PCI.pm
+++ b/PVE/QemuServer/PCI.pm
@@ -719,11 +719,12 @@ my $write_pci_reservation_unlocked = sub {
 
 # removes all PCI device reservations held by the `vmid`
 sub remove_pci_reservation {
-    my ($vmid) = @_;
+    my ($vmid, $pciids) = @_;
 
     PVE::Tools::lock_file($PCIID_RESERVATION_LOCK, 2, sub {
 	my $reservation_list = $parse_pci_reservation_unlocked->();
 	for my $id (keys %$reservation_list) {
+	    next if defined($pciids) && !grep { $_ eq $id } $pciids->@*;
 	    my $reservation = $reservation_list->{$id};
 	    next if $reservation->{vmid} != $vmid;
 	    delete $reservation_list->{$id};
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] [PATCH qemu-server 3/3] pci: mdev: adapt to nvidia interface with kernel >= 6.8
  2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
                   ` (3 preceding siblings ...)
  2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 2/3] pci: remove pci reservation: optionally give list of ids to remove Dominik Csapak
@ 2024-08-06 12:22 ` Dominik Csapak
  2024-08-06 12:22 ` [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings Dominik Csapak
  5 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:22 UTC (permalink / raw)
  To: pve-devel

since kernel 6.8, NVIDIAs vGPU driver does not use the generic mdev
interface anymore, since they relied on a feature there which is not
available anymore. IIUC the kernel [0] recommends drivers to implement
their own device specific features since putting all in the generic one
does not make sense.

They now have an 'nvidia' folder in the device sysfs path, which
contains the files `creatable_vgpu_types`/`current_vgpu_type` to
control the virtual functions model, and then the whole virtual function
has to be passed through (although without resetting and changing to the
vfio-pci driver).

This patch implements changes so that from a config perspective, it
still is an mediated device, and we map the functionality iff the device
has no mediated devices but the new NVIDIAs sysfsapi and the model name
is 'nvidia-<..>'

It behaves a bit different than mdevs and normal pci passthrough, as we
have to choose the correct device immediately since it's bound to the
pciid, but we must not bind the device to vfio-pci as the NVIDIA driver
implements this functionality itself.

When cleaning up, we iterate over all reserved devices (since for a
mapping we can't know at this point which was chosen besides looking at
the reservations) and reset the vgpu model to '0', so it frees up the
reservation from NVIDIAs side. (We also do that in a loop, since it's
not always immediately ready after QEMU closes)

A general problem (but that was previously also the case) is that a
showcmd (for a not running guest) reserves the pciids, which might block
an execution of a different real vm. This is now a bit more problematic
as we (temporarily) set the vgpu type then.

0: https://docs.kernel.org/driver-api/vfio-pci-device-specific-driver-acceptance.html

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 PVE/QemuServer.pm                | 25 ++++++++--
 PVE/QemuServer/PCI.pm            | 80 ++++++++++++++++++++++++++++++--
 test/run_config2command_tests.pl |  3 ++
 3 files changed, 100 insertions(+), 8 deletions(-)

diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index b2cbe00e..981e8fa7 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -5833,8 +5833,8 @@ sub vm_start_nolock {
 
 	    my $chosen_mdev;
 	    for my $dev ($d->{ids}->@*) {
-		my $info = eval { PVE::QemuServer::PCI::prepare_pci_device($vmid, $dev->{id}, $index, $d->{mdev}) };
-		if ($d->{mdev}) {
+		my $info = eval { PVE::QemuServer::PCI::prepare_pci_device($vmid, $dev->{id}, $index, $d) };
+		if ($d->{mdev} || $d->{nvidia}) {
 		    warn $@ if $@;
 		    $chosen_mdev = $info;
 		    last if $chosen_mdev; # if successful, we're done
@@ -5843,7 +5843,7 @@ sub vm_start_nolock {
 		}
 	    }
 
-	    next if !$d->{mdev};
+	    next if !$d->{mdev} && !$d->{nvidia};
 	    die "could not create mediated device\n" if !defined($chosen_mdev);
 
 	    # nvidia grid needs the uuid of the mdev as qemu parameter
@@ -6175,6 +6175,25 @@ sub cleanup_pci_devices {
     # templates don't use pci devices
     return if $conf->{template};
 
+    my $reservations = PVE::QemuServer::PCI::get_reservations($vmid);
+    # clean up nvidia devices
+    for my $id ($reservations->@*) {
+	$id = '0000:'.$id if $id !~ m/^0000:/;
+
+	my $create_path = "/sys/bus/pci/devices/$id/nvidia/current_vgpu_type";
+
+	next if ! -f $create_path;
+
+	for (my $i = 0; $i < 10; $i++) {
+	    last if file_read_firstline($create_path) eq "0";
+	    sleep 1;
+	    PVE::SysFSTools::file_write($create_path, "0");
+	}
+	if (file_read_firstline($create_path) ne "0") {
+	    warn "could not cleanup nvidia vgpu for '$id'\n";
+	}
+    }
+
     foreach my $key (keys %$conf) {
 	next if $key !~ m/^hostpci(\d+)$/;
 	my $hostpciindex = $1;
diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
index ae180e08..1ea043bb 100644
--- a/PVE/QemuServer/PCI.pm
+++ b/PVE/QemuServer/PCI.pm
@@ -453,7 +453,12 @@ sub parse_hostpci {
 	}
 	if (scalar($ids->@*) > 1) {
 	    $res->{'has-multifunction'} = 1;
-	    die "cannot use mediated device with multifunction device\n" if $res->{mdev};
+	    die "cannot use mediated device with multifunction device\n" if $res->{mdev} || $res->{nvidia};
+	} elsif ($res->{mdev}) {
+	    if ($ids->[0]->{nvidia} && $res->{mdev} =~ m/^nvidia-(\d+)$/) {
+		$res->{nvidia} = $1;
+		delete $res->{mdev};
+	    }
 	}
 	push $res->{ids}->@*, $ids;
     }
@@ -497,7 +502,7 @@ sub parse_hostpci_devices {
 	    die "legacy IGD assignment is not compatible with x-vga\n"
 		if $d->{'x-vga'};
 	    die "legacy IGD assignment is not compatible with mdev\n"
-		if $d->{mdev};
+		if $d->{mdev} || $d->{nvidia};
 	    die "legacy IGD assignment is not compatible with q35\n"
 		if $q35;
 	    die "legacy IGD assignment is not compatible with multifunction devices\n"
@@ -515,6 +520,41 @@ sub parse_hostpci_devices {
     return $parsed_devices;
 }
 
+# set vgpu type of a vf of an nvidia gpu with kernel 6.8 or newer
+my sub create_nvidia_device {
+    my ($id, $model) = @_;
+
+    $id = '0000:'.$id if $id !~ m/^0000:/;
+
+    my $creation = "/sys/bus/pci/devices/$id/nvidia/current_vgpu_type";
+
+    die "no nvidia sysfs api for '$id'\n" if ! -f $creation;
+
+    my $current = PVE::Tools::file_read_firstline($creation);
+    if ($current ne "0") {
+	return 1 if $current eq $model;
+	# reset vgpu type so we can see all available and set the real device
+	die "unable to reset vgpu type for '$id'\n" if !PVE::SysFSTools::file_write($creation, "0");
+    }
+
+    my $types = PVE::SysFSTools::get_mdev_types($id);
+    my $selected;
+    for my $type_definition ($types->@*) {
+	next if $type_definition->{type} ne "nvidia-$model";
+	$selected = $type_definition;
+    }
+
+    if (!defined($selected) || $selected->{available} < 1) {
+	die "vgpu type '$model' not available for '$id'\n";
+    }
+
+    if (!PVE::SysFSTools::file_write($creation, $model)) {
+	die "could not set vgpu type to '$model' for '$id'\n";
+    }
+
+    return 1;
+}
+
 # takes the hash returned by parse_hostpci_devices and for all non mdev gpus,
 # selects one of the given alternatives by trying to reserve it
 #
@@ -541,7 +581,7 @@ my sub choose_hostpci_devices {
 	my $device = $devices->{"hostpci$i"};
 	next if !$device;
 
-	if ($device->{mdev}) {
+	if ($device->{mdev} && !$device->{nvidia}) {
 	    $device->{ids} = [ map { $_->[0] } $device->{ids}->@* ];
 	    next;
 	}
@@ -550,6 +590,10 @@ my sub choose_hostpci_devices {
 	    # we only have one alternative, use that
 	    $device->{ids} = $device->{ids}->[0];
 	    $add_used_device->($device->{ids});
+	    if ($device->{nvidia} && !$is_running) {
+		reserve_pci_usage($device->{ids}->[0]->{id}, $vmid, 10, undef);
+		create_nvidia_device($device->{ids}->[0]->{id}, $device->{nvidia});
+	    }
 	    next;
 	}
 
@@ -563,6 +607,15 @@ my sub choose_hostpci_devices {
 		next if $@;
 	    }
 
+	    if ($device->{nvidia} && !$is_running) {
+		eval { create_nvidia_device($ids->[0], $device->{nvidia}) };
+		if (my $err = $@) {
+		    warn $err;
+		    remove_pci_reservation($vmid, $ids);
+		    next;
+		}
+	    }
+
 	    # found one that is not used or reserved
 	    $add_used_device->($alternative);
 	    $device->{ids} = $alternative;
@@ -661,13 +714,15 @@ sub print_hostpci_devices {
 }
 
 sub prepare_pci_device {
-    my ($vmid, $pciid, $index, $mdev) = @_;
+    my ($vmid, $pciid, $index, $device) = @_;
 
     my $info = PVE::SysFSTools::pci_device_info("$pciid");
     die "cannot prepare PCI pass-through, IOMMU not present\n" if !PVE::SysFSTools::check_iommu_support();
     die "no pci device info for device '$pciid'\n" if !$info;
 
-    if ($mdev) {
+    if ($device->{nvidia}) {
+	# nothing to do
+    } elsif (my $mdev = $device->{mdev}) {
 	my $uuid = generate_mdev_uuid($vmid, $index);
 	PVE::SysFSTools::pci_create_mdev_device($pciid, $uuid, $mdev);
     } else {
@@ -734,6 +789,21 @@ sub remove_pci_reservation {
     die $@ if $@;
 }
 
+# return all currently reserved ids from the given vmid
+sub get_reservations {
+    my ($vmid) = @_;
+
+    my $reservations = $parse_pci_reservation_unlocked->();
+
+    my $list = [];
+
+    for my $pci_id (sort keys $reservations->%*) {
+	push $list->@*, $pci_id if $reservations->{$pci_id}->{vmid} == $vmid;
+    }
+
+    return $list;
+}
+
 sub reserve_pci_usage {
     my ($requested_ids, $vmid, $timeout, $pid) = @_;
 
diff --git a/test/run_config2command_tests.pl b/test/run_config2command_tests.pl
index 9b5e87ff..8c525f09 100755
--- a/test/run_config2command_tests.pl
+++ b/test/run_config2command_tests.pl
@@ -387,6 +387,9 @@ $pci_module->mock(
 
 	return undef;
     },
+    create_nvidia_device => sub {
+	return 1;
+    }
 );
 
 sub diff($$) {
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings
  2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
                   ` (4 preceding siblings ...)
  2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 3/3] pci: mdev: adapt to nvidia interface with kernel >= 6.8 Dominik Csapak
@ 2024-08-06 12:22 ` Dominik Csapak
  5 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:22 UTC (permalink / raw)
  To: pve-devel

currently when we have a pci resource mapping, we manually check only
the available models for the first pci entry. This often works, but not
always, since one could have completely different devices in one
mapping, or with the new nvidia sysfs api we don't get the generally
available models.

To improve this, extend the 'pciid' regex to include pciids or mapping
names, and for mappings, iterate over all local pci devices in it and
extract the mdev types.

This also vastly simplifies the ui code, since we only have to give the
mapping to the selector instead of an (arbitrarily selected) pci id.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 PVE/API2/Hardware/PCI.pm     | 45 +++++++++++++++++++++++++++++-------
 www/manager6/qemu/PCIEdit.js | 12 +---------
 2 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/PVE/API2/Hardware/PCI.pm b/PVE/API2/Hardware/PCI.pm
index a3a689bf..7135a605 100644
--- a/PVE/API2/Hardware/PCI.pm
+++ b/PVE/API2/Hardware/PCI.pm
@@ -135,7 +135,7 @@ __PACKAGE__->register_method ({
 
 __PACKAGE__->register_method ({
     name => 'pciindex',
-    path => '{pciid}',
+    path => '{pciid-or-mapping}',
     method => 'GET',
     description => "Index of available pci methods",
     permissions => {
@@ -145,9 +145,9 @@ __PACKAGE__->register_method ({
 	additionalProperties => 0,
 	properties => {
 	    node => get_standard_option('pve-node'),
-	    pciid => {
+	    'pciid-or-mapping' => {
 		type => 'string',
-		pattern => '(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F]',
+		pattern => '(?:(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F])|([a-zA-Z][a-zA-Z0-9_-]+)',
 	    },
 	},
     },
@@ -171,7 +171,7 @@ __PACKAGE__->register_method ({
 
 __PACKAGE__->register_method ({
     name => 'mdevscan',
-    path => '{pciid}/mdev',
+    path => '{pciid-or-mapping}/mdev',
     method => 'GET',
     description => "List mediated device types for given PCI device.",
     protected => 1,
@@ -183,10 +183,10 @@ __PACKAGE__->register_method ({
 	additionalProperties => 0,
 	properties => {
 	    node => get_standard_option('pve-node'),
-	    pciid => {
+	    'pciid-or-mapping' => {
 		type => 'string',
-		pattern => '(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F]',
-		description => "The PCI ID to list the mdev types for."
+		pattern => '(?:(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F])|([a-zA-Z][a-zA-Z0-9_-]+)',
+		description => "The PCI ID or mapping to list the mdev types for."
 	    },
 	},
     },
@@ -206,6 +206,12 @@ __PACKAGE__->register_method ({
 		},
 		description => {
 		    type => 'string',
+		    description => "Additional description of the type."
+		},
+		name => {
+		    type => 'string',
+		    optional => 1,
+		    description => 'A human readable name for the type.',
 		},
 	    },
 	},
@@ -213,5 +219,28 @@ __PACKAGE__->register_method ({
     code => sub {
 	my ($param) = @_;
 
-	return PVE::SysFSTools::get_mdev_types($param->{pciid});
+	if ($param->{'pciid-or-mapping'} =~ m/^(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F]$/) {
+	    return PVE::SysFSTools::get_mdev_types($param->{'pciid-or-mapping'});
+	} else {
+	    my $mapping = $param->{'pciid-or-mapping'};
+
+	    my $types = {};
+	    my $devices = PVE::Mapping::PCI::find_on_current_node($mapping);
+	    for my $device ($devices->@*) {
+		my $id = $device->{path};
+		next if $id =~ m/;/; # mdev not supported for multifunction devices
+
+		my $device_types = PVE::SysFSTools::get_mdev_types($id);
+
+		for my $type_definition ($device_types->@*) {
+		    my $type = $type_definition->{type};
+		    if (!defined($types->{$type})) {
+			$types->{$type} = $type_definition;
+		    }
+		}
+	    }
+
+	    return [sort { $a->{type} cmp $b->{type} } values($types->%*)];
+	}
+
     }});
diff --git a/www/manager6/qemu/PCIEdit.js b/www/manager6/qemu/PCIEdit.js
index 970e81c2..7c1b7b85 100644
--- a/www/manager6/qemu/PCIEdit.js
+++ b/www/manager6/qemu/PCIEdit.js
@@ -67,17 +67,7 @@ Ext.define('PVE.qemu.PCIInputPanel', {
 
 	    let path = value;
 	    if (pciDev.data.map) {
-		// find local mapping
-		for (const entry of pciDev.data.map) {
-		    let mapping = PVE.Parser.parsePropertyString(entry);
-		    if (mapping.node === pcisel.up('inputpanel').nodename) {
-			path = mapping.path.split(';')[0];
-			break;
-		    }
-		}
-		if (path.indexOf('.') === -1) {
-		    path += '.0';
-		}
+		path = pciDev.data.id;
 	    }
 
 	    if (pciDev.data.mdev) {
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-08-06 12:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
2024-08-06 12:21 ` [pve-devel] [PATCH common 1/2] SysFSTools: handle new nvidia syfsapi as mdev Dominik Csapak
2024-08-06 12:21 ` [pve-devel] [PATCH common 2/2] SysFSTools: lscpi: move mdev and iommugroup check outside of verbose Dominik Csapak
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 1/3] pci: choose devices: don't reserve pciids when vm is already running Dominik Csapak
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 2/3] pci: remove pci reservation: optionally give list of ids to remove Dominik Csapak
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 3/3] pci: mdev: adapt to nvidia interface with kernel >= 6.8 Dominik Csapak
2024-08-06 12:22 ` [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings Dominik Csapak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal