* [pve-devel] [PATCH common 1/2] SysFSTools: handle new nvidia syfsapi as mdev
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
@ 2024-08-06 12:21 ` Dominik Csapak
2024-08-06 12:21 ` [pve-devel] [PATCH common 2/2] SysFSTools: lscpi: move mdev and iommugroup check outside of verbose Dominik Csapak
` (6 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:21 UTC (permalink / raw)
To: pve-devel
with kernel 6.8 NVIDIAs vGPU driver has a different api than the
previous 'mediated devices'. Adapt our sysfcode to also recognize this
for the 'mdev' paths and add another 'nvidia' property so we can detect
this.
Also parse the new api when they exist instead of the mediated devices.
The biggest difference to the existing mdev api for our use is that the
devices don't report all generally available devices, only the
createable ones. So if a user wants to configure a VM, the selection is
restricted by what may currently run on the GPU (depending ont the exact
settings, e.g. mixed mode gpus where different models can be mixed on a
single GPU; not the default though)
We could overcome this, when we'd parse the general info from the
'nvidia-smi' tool, though I'm currently unsure if that interface is
stable and intended to be parsed (there is no json output or similar
AFAIK)
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
best viewed with '-w'
src/PVE/SysFSTools.pm | 65 ++++++++++++++++++++++++++++---------------
1 file changed, 42 insertions(+), 23 deletions(-)
diff --git a/src/PVE/SysFSTools.pm b/src/PVE/SysFSTools.pm
index 8eb9f2e..25b2f31 100644
--- a/src/PVE/SysFSTools.pm
+++ b/src/PVE/SysFSTools.pm
@@ -112,6 +112,10 @@ sub lspci {
if (-d "$devdir/mdev_supported_types") {
$res->{mdev} = 1;
+ } elsif (-d "$devdir/nvidia") {
+ # nvidia driver for kernel 6.8 or higher
+ $res->{mdev} = 1; # for api compatibility
+ $res->{nvidia} = 1;
}
my $device_hash = $ids->{$vendor}->{devices}->{$device} // {};
@@ -159,31 +163,46 @@ sub get_mdev_types {
my $types = [];
- my $mdev_path = "$pcisysfs/devices/$id/mdev_supported_types";
- if (!-d $mdev_path) {
- return $types;
+ my $dev_path = "$pcisysfs/devices/$id";
+ my $mdev_path = "$dev_path/mdev_supported_types";
+ my $nvidia_path = "$dev_path/nvidia/creatable_vgpu_types";
+ if (-d $mdev_path) {
+ dir_glob_foreach($mdev_path, '[^\.].*', sub {
+ my ($type) = @_;
+
+ my $type_path = "$mdev_path/$type";
+
+ my $available = int(file_read_firstline("$type_path/available_instances"));
+ my $description = PVE::Tools::file_get_contents("$type_path/description");
+
+ my $entry = {
+ type => $type,
+ description => $description,
+ available => $available,
+ };
+
+ my $name = file_read_firstline("$type_path/name");
+ $entry->{name} = $name if defined($name);
+
+ push @$types, $entry;
+ });
+ } elsif (-f $nvidia_path) {
+ my $creatable = PVE::Tools::file_get_contents($nvidia_path);
+ for my $line (split("\n", $creatable)) {
+ next if $line =~ m/^ID/; # header
+ next if $line !~ m/^(.*?)\s*:\s*(.*)$/;
+ my $id = $1;
+ my $name = $2;
+
+ push $types->@*, {
+ type => "nvidia-$id", # backwards compatibility
+ description => "", # TODO, read from xml/nvidia-smi ?
+ available => 1,
+ name => $name,
+ }
+ }
}
- dir_glob_foreach($mdev_path, '[^\.].*', sub {
- my ($type) = @_;
-
- my $type_path = "$mdev_path/$type";
-
- my $available = int(file_read_firstline("$type_path/available_instances"));
- my $description = PVE::Tools::file_get_contents("$type_path/description");
-
- my $entry = {
- type => $type,
- description => $description,
- available => $available,
- };
-
- my $name = file_read_firstline("$type_path/name");
- $entry->{name} = $name if defined($name);
-
- push @$types, $entry;
- });
-
return $types;
}
--
2.39.2
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* [pve-devel] [PATCH common 2/2] SysFSTools: lscpi: move mdev and iommugroup check outside of verbose
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
2024-08-06 12:21 ` [pve-devel] [PATCH common 1/2] SysFSTools: handle new nvidia syfsapi as mdev Dominik Csapak
@ 2024-08-06 12:21 ` Dominik Csapak
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 1/3] pci: choose devices: don't reserve pciids when vm is already running Dominik Csapak
` (5 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:21 UTC (permalink / raw)
To: pve-devel
they should not be expensive (only reading/file checking in sysfs; the
parsed vendor/id names are not required) so we should include them
always.
We need at least the mdev part later at a point where we're not
interested in the rest of the verbose mode.
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
best viewed with -w
src/PVE/SysFSTools.pm | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/src/PVE/SysFSTools.pm b/src/PVE/SysFSTools.pm
index 25b2f31..0bde6d7 100644
--- a/src/PVE/SysFSTools.pm
+++ b/src/PVE/SysFSTools.pm
@@ -103,21 +103,21 @@ sub lspci {
return;
}
- if ($verbose) {
- $res->{iommugroup} = -1;
- if (-e "$devdir/iommu_group") {
- my ($iommugroup) = (readlink("$devdir/iommu_group") =~ m/\/(\d+)$/);
- $res->{iommugroup} = int($iommugroup);
- }
+ $res->{iommugroup} = -1;
+ if (-e "$devdir/iommu_group") {
+ my ($iommugroup) = (readlink("$devdir/iommu_group") =~ m/\/(\d+)$/);
+ $res->{iommugroup} = int($iommugroup);
+ }
- if (-d "$devdir/mdev_supported_types") {
- $res->{mdev} = 1;
- } elsif (-d "$devdir/nvidia") {
- # nvidia driver for kernel 6.8 or higher
- $res->{mdev} = 1; # for api compatibility
- $res->{nvidia} = 1;
- }
+ if (-d "$devdir/mdev_supported_types") {
+ $res->{mdev} = 1;
+ } elsif (-d "$devdir/nvidia") {
+ # nvidia driver for kernel 6.8 or higher
+ $res->{mdev} = 1; # for api compatibility
+ $res->{nvidia} = 1;
+ }
+ if ($verbose) {
my $device_hash = $ids->{$vendor}->{devices}->{$device} // {};
my $sub_vendor = file_read_firstline("$devdir/subsystem_vendor");
--
2.39.2
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* [pve-devel] [PATCH qemu-server 1/3] pci: choose devices: don't reserve pciids when vm is already running
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
2024-08-06 12:21 ` [pve-devel] [PATCH common 1/2] SysFSTools: handle new nvidia syfsapi as mdev Dominik Csapak
2024-08-06 12:21 ` [pve-devel] [PATCH common 2/2] SysFSTools: lscpi: move mdev and iommugroup check outside of verbose Dominik Csapak
@ 2024-08-06 12:22 ` Dominik Csapak
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 2/3] pci: remove pci reservation: optionally give list of ids to remove Dominik Csapak
` (4 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:22 UTC (permalink / raw)
To: pve-devel
since the only way this could happen is when we're being called
from 'qm showcmd' and there we don't want to reserve or create anything.
In case the vm was not running, we actually reserve the devices, so we
want to call 'cleanup_pci_devices' after to remove those again. This
minimizes the timespan where those devices are not available for real vm
starts.
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
alternatively we could pass the info through that we're in 'showcmd',
but since that's a relatively long call chain and the involved functions
already have a lot of parameters, i opted for this. I'm not opposed to
another solution, if wanted, though.
PVE/QemuServer.pm | 5 +++++
PVE/QemuServer/PCI.pm | 9 +++++++--
test/run_config2command_tests.pl | 5 ++++-
3 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index b26da505..b2cbe00e 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -6129,6 +6129,11 @@ sub vm_commandline {
my $defaults = load_defaults();
my $cmd = config_to_command($storecfg, $vmid, $conf, $defaults, $forcemachine, $forcecpu);
+ # if the vm is not running, we need to clean up the reserved/created devices
+ if (!PVE::QemuServer::Helpers::vm_running_locally($vmid)) {
+ eval { cleanup_pci_devices($vmid, $conf) };
+ warn $@ if $@;
+ }
return PVE::Tools::cmd2string($cmd);
}
diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
index 1673041b..97eb2165 100644
--- a/PVE/QemuServer/PCI.pm
+++ b/PVE/QemuServer/PCI.pm
@@ -523,6 +523,9 @@ sub parse_hostpci_devices {
my sub choose_hostpci_devices {
my ($devices, $vmid) = @_;
+ # if the vm is running, we must be in 'showcmd', so don't actually reserve or create anything
+ my $is_running = PVE::QemuServer::Helpers::vm_running_locally($vmid) ? 1 : 0;
+
my $used = {};
my $add_used_device = sub {
@@ -555,8 +558,10 @@ my sub choose_hostpci_devices {
my $ids = [map { $_->{id} } @$alternative];
next if grep { defined($used->{$_}) } @$ids; # already used
- eval { reserve_pci_usage($ids, $vmid, 10, undef) };
- next if $@;
+ if (!$is_running) {
+ eval { reserve_pci_usage($ids, $vmid, 10, undef) };
+ next if $@;
+ }
# found one that is not used or reserved
$add_used_device->($alternative);
diff --git a/test/run_config2command_tests.pl b/test/run_config2command_tests.pl
index d48ef562..9b5e87ff 100755
--- a/test/run_config2command_tests.pl
+++ b/test/run_config2command_tests.pl
@@ -196,7 +196,10 @@ $qemu_server_module->mock(
},
get_initiator_name => sub {
return 'iqn.1993-08.org.debian:01:aabbccddeeff';
- }
+ },
+ cleanup_pci_devices => {
+ # do nothing
+ },
);
my $qemu_server_config;
--
2.39.2
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* [pve-devel] [PATCH qemu-server 2/3] pci: remove pci reservation: optionally give list of ids to remove
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
` (2 preceding siblings ...)
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 1/3] pci: choose devices: don't reserve pciids when vm is already running Dominik Csapak
@ 2024-08-06 12:22 ` Dominik Csapak
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 3/3] pci: mdev: adapt to nvidia interface with kernel >= 6.8 Dominik Csapak
` (3 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:22 UTC (permalink / raw)
To: pve-devel
so we can partially release ids again. This will be necessary for
NVIDIAs new sysfs api
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
PVE/QemuServer/PCI.pm | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
index 97eb2165..ae180e08 100644
--- a/PVE/QemuServer/PCI.pm
+++ b/PVE/QemuServer/PCI.pm
@@ -719,11 +719,12 @@ my $write_pci_reservation_unlocked = sub {
# removes all PCI device reservations held by the `vmid`
sub remove_pci_reservation {
- my ($vmid) = @_;
+ my ($vmid, $pciids) = @_;
PVE::Tools::lock_file($PCIID_RESERVATION_LOCK, 2, sub {
my $reservation_list = $parse_pci_reservation_unlocked->();
for my $id (keys %$reservation_list) {
+ next if defined($pciids) && !grep { $_ eq $id } $pciids->@*;
my $reservation = $reservation_list->{$id};
next if $reservation->{vmid} != $vmid;
delete $reservation_list->{$id};
--
2.39.2
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* [pve-devel] [PATCH qemu-server 3/3] pci: mdev: adapt to nvidia interface with kernel >= 6.8
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
` (3 preceding siblings ...)
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 2/3] pci: remove pci reservation: optionally give list of ids to remove Dominik Csapak
@ 2024-08-06 12:22 ` Dominik Csapak
2024-08-06 12:22 ` [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings Dominik Csapak
` (2 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:22 UTC (permalink / raw)
To: pve-devel
since kernel 6.8, NVIDIAs vGPU driver does not use the generic mdev
interface anymore, since they relied on a feature there which is not
available anymore. IIUC the kernel [0] recommends drivers to implement
their own device specific features since putting all in the generic one
does not make sense.
They now have an 'nvidia' folder in the device sysfs path, which
contains the files `creatable_vgpu_types`/`current_vgpu_type` to
control the virtual functions model, and then the whole virtual function
has to be passed through (although without resetting and changing to the
vfio-pci driver).
This patch implements changes so that from a config perspective, it
still is an mediated device, and we map the functionality iff the device
has no mediated devices but the new NVIDIAs sysfsapi and the model name
is 'nvidia-<..>'
It behaves a bit different than mdevs and normal pci passthrough, as we
have to choose the correct device immediately since it's bound to the
pciid, but we must not bind the device to vfio-pci as the NVIDIA driver
implements this functionality itself.
When cleaning up, we iterate over all reserved devices (since for a
mapping we can't know at this point which was chosen besides looking at
the reservations) and reset the vgpu model to '0', so it frees up the
reservation from NVIDIAs side. (We also do that in a loop, since it's
not always immediately ready after QEMU closes)
A general problem (but that was previously also the case) is that a
showcmd (for a not running guest) reserves the pciids, which might block
an execution of a different real vm. This is now a bit more problematic
as we (temporarily) set the vgpu type then.
0: https://docs.kernel.org/driver-api/vfio-pci-device-specific-driver-acceptance.html
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
PVE/QemuServer.pm | 25 ++++++++--
PVE/QemuServer/PCI.pm | 80 ++++++++++++++++++++++++++++++--
test/run_config2command_tests.pl | 3 ++
3 files changed, 100 insertions(+), 8 deletions(-)
diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index b2cbe00e..981e8fa7 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -5833,8 +5833,8 @@ sub vm_start_nolock {
my $chosen_mdev;
for my $dev ($d->{ids}->@*) {
- my $info = eval { PVE::QemuServer::PCI::prepare_pci_device($vmid, $dev->{id}, $index, $d->{mdev}) };
- if ($d->{mdev}) {
+ my $info = eval { PVE::QemuServer::PCI::prepare_pci_device($vmid, $dev->{id}, $index, $d) };
+ if ($d->{mdev} || $d->{nvidia}) {
warn $@ if $@;
$chosen_mdev = $info;
last if $chosen_mdev; # if successful, we're done
@@ -5843,7 +5843,7 @@ sub vm_start_nolock {
}
}
- next if !$d->{mdev};
+ next if !$d->{mdev} && !$d->{nvidia};
die "could not create mediated device\n" if !defined($chosen_mdev);
# nvidia grid needs the uuid of the mdev as qemu parameter
@@ -6175,6 +6175,25 @@ sub cleanup_pci_devices {
# templates don't use pci devices
return if $conf->{template};
+ my $reservations = PVE::QemuServer::PCI::get_reservations($vmid);
+ # clean up nvidia devices
+ for my $id ($reservations->@*) {
+ $id = '0000:'.$id if $id !~ m/^0000:/;
+
+ my $create_path = "/sys/bus/pci/devices/$id/nvidia/current_vgpu_type";
+
+ next if ! -f $create_path;
+
+ for (my $i = 0; $i < 10; $i++) {
+ last if file_read_firstline($create_path) eq "0";
+ sleep 1;
+ PVE::SysFSTools::file_write($create_path, "0");
+ }
+ if (file_read_firstline($create_path) ne "0") {
+ warn "could not cleanup nvidia vgpu for '$id'\n";
+ }
+ }
+
foreach my $key (keys %$conf) {
next if $key !~ m/^hostpci(\d+)$/;
my $hostpciindex = $1;
diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
index ae180e08..1ea043bb 100644
--- a/PVE/QemuServer/PCI.pm
+++ b/PVE/QemuServer/PCI.pm
@@ -453,7 +453,12 @@ sub parse_hostpci {
}
if (scalar($ids->@*) > 1) {
$res->{'has-multifunction'} = 1;
- die "cannot use mediated device with multifunction device\n" if $res->{mdev};
+ die "cannot use mediated device with multifunction device\n" if $res->{mdev} || $res->{nvidia};
+ } elsif ($res->{mdev}) {
+ if ($ids->[0]->{nvidia} && $res->{mdev} =~ m/^nvidia-(\d+)$/) {
+ $res->{nvidia} = $1;
+ delete $res->{mdev};
+ }
}
push $res->{ids}->@*, $ids;
}
@@ -497,7 +502,7 @@ sub parse_hostpci_devices {
die "legacy IGD assignment is not compatible with x-vga\n"
if $d->{'x-vga'};
die "legacy IGD assignment is not compatible with mdev\n"
- if $d->{mdev};
+ if $d->{mdev} || $d->{nvidia};
die "legacy IGD assignment is not compatible with q35\n"
if $q35;
die "legacy IGD assignment is not compatible with multifunction devices\n"
@@ -515,6 +520,41 @@ sub parse_hostpci_devices {
return $parsed_devices;
}
+# set vgpu type of a vf of an nvidia gpu with kernel 6.8 or newer
+my sub create_nvidia_device {
+ my ($id, $model) = @_;
+
+ $id = '0000:'.$id if $id !~ m/^0000:/;
+
+ my $creation = "/sys/bus/pci/devices/$id/nvidia/current_vgpu_type";
+
+ die "no nvidia sysfs api for '$id'\n" if ! -f $creation;
+
+ my $current = PVE::Tools::file_read_firstline($creation);
+ if ($current ne "0") {
+ return 1 if $current eq $model;
+ # reset vgpu type so we can see all available and set the real device
+ die "unable to reset vgpu type for '$id'\n" if !PVE::SysFSTools::file_write($creation, "0");
+ }
+
+ my $types = PVE::SysFSTools::get_mdev_types($id);
+ my $selected;
+ for my $type_definition ($types->@*) {
+ next if $type_definition->{type} ne "nvidia-$model";
+ $selected = $type_definition;
+ }
+
+ if (!defined($selected) || $selected->{available} < 1) {
+ die "vgpu type '$model' not available for '$id'\n";
+ }
+
+ if (!PVE::SysFSTools::file_write($creation, $model)) {
+ die "could not set vgpu type to '$model' for '$id'\n";
+ }
+
+ return 1;
+}
+
# takes the hash returned by parse_hostpci_devices and for all non mdev gpus,
# selects one of the given alternatives by trying to reserve it
#
@@ -541,7 +581,7 @@ my sub choose_hostpci_devices {
my $device = $devices->{"hostpci$i"};
next if !$device;
- if ($device->{mdev}) {
+ if ($device->{mdev} && !$device->{nvidia}) {
$device->{ids} = [ map { $_->[0] } $device->{ids}->@* ];
next;
}
@@ -550,6 +590,10 @@ my sub choose_hostpci_devices {
# we only have one alternative, use that
$device->{ids} = $device->{ids}->[0];
$add_used_device->($device->{ids});
+ if ($device->{nvidia} && !$is_running) {
+ reserve_pci_usage($device->{ids}->[0]->{id}, $vmid, 10, undef);
+ create_nvidia_device($device->{ids}->[0]->{id}, $device->{nvidia});
+ }
next;
}
@@ -563,6 +607,15 @@ my sub choose_hostpci_devices {
next if $@;
}
+ if ($device->{nvidia} && !$is_running) {
+ eval { create_nvidia_device($ids->[0], $device->{nvidia}) };
+ if (my $err = $@) {
+ warn $err;
+ remove_pci_reservation($vmid, $ids);
+ next;
+ }
+ }
+
# found one that is not used or reserved
$add_used_device->($alternative);
$device->{ids} = $alternative;
@@ -661,13 +714,15 @@ sub print_hostpci_devices {
}
sub prepare_pci_device {
- my ($vmid, $pciid, $index, $mdev) = @_;
+ my ($vmid, $pciid, $index, $device) = @_;
my $info = PVE::SysFSTools::pci_device_info("$pciid");
die "cannot prepare PCI pass-through, IOMMU not present\n" if !PVE::SysFSTools::check_iommu_support();
die "no pci device info for device '$pciid'\n" if !$info;
- if ($mdev) {
+ if ($device->{nvidia}) {
+ # nothing to do
+ } elsif (my $mdev = $device->{mdev}) {
my $uuid = generate_mdev_uuid($vmid, $index);
PVE::SysFSTools::pci_create_mdev_device($pciid, $uuid, $mdev);
} else {
@@ -734,6 +789,21 @@ sub remove_pci_reservation {
die $@ if $@;
}
+# return all currently reserved ids from the given vmid
+sub get_reservations {
+ my ($vmid) = @_;
+
+ my $reservations = $parse_pci_reservation_unlocked->();
+
+ my $list = [];
+
+ for my $pci_id (sort keys $reservations->%*) {
+ push $list->@*, $pci_id if $reservations->{$pci_id}->{vmid} == $vmid;
+ }
+
+ return $list;
+}
+
sub reserve_pci_usage {
my ($requested_ids, $vmid, $timeout, $pid) = @_;
diff --git a/test/run_config2command_tests.pl b/test/run_config2command_tests.pl
index 9b5e87ff..8c525f09 100755
--- a/test/run_config2command_tests.pl
+++ b/test/run_config2command_tests.pl
@@ -387,6 +387,9 @@ $pci_module->mock(
return undef;
},
+ create_nvidia_device => sub {
+ return 1;
+ }
);
sub diff($$) {
--
2.39.2
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
` (4 preceding siblings ...)
2024-08-06 12:22 ` [pve-devel] [PATCH qemu-server 3/3] pci: mdev: adapt to nvidia interface with kernel >= 6.8 Dominik Csapak
@ 2024-08-06 12:22 ` Dominik Csapak
2024-10-24 16:53 ` Thomas Lamprecht
2024-10-17 10:16 ` [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Christoph Heiss
2024-10-24 17:06 ` [pve-devel] partially-applied: " Thomas Lamprecht
7 siblings, 1 reply; 12+ messages in thread
From: Dominik Csapak @ 2024-08-06 12:22 UTC (permalink / raw)
To: pve-devel
currently when we have a pci resource mapping, we manually check only
the available models for the first pci entry. This often works, but not
always, since one could have completely different devices in one
mapping, or with the new nvidia sysfs api we don't get the generally
available models.
To improve this, extend the 'pciid' regex to include pciids or mapping
names, and for mappings, iterate over all local pci devices in it and
extract the mdev types.
This also vastly simplifies the ui code, since we only have to give the
mapping to the selector instead of an (arbitrarily selected) pci id.
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
PVE/API2/Hardware/PCI.pm | 45 +++++++++++++++++++++++++++++-------
www/manager6/qemu/PCIEdit.js | 12 +---------
2 files changed, 38 insertions(+), 19 deletions(-)
diff --git a/PVE/API2/Hardware/PCI.pm b/PVE/API2/Hardware/PCI.pm
index a3a689bf..7135a605 100644
--- a/PVE/API2/Hardware/PCI.pm
+++ b/PVE/API2/Hardware/PCI.pm
@@ -135,7 +135,7 @@ __PACKAGE__->register_method ({
__PACKAGE__->register_method ({
name => 'pciindex',
- path => '{pciid}',
+ path => '{pciid-or-mapping}',
method => 'GET',
description => "Index of available pci methods",
permissions => {
@@ -145,9 +145,9 @@ __PACKAGE__->register_method ({
additionalProperties => 0,
properties => {
node => get_standard_option('pve-node'),
- pciid => {
+ 'pciid-or-mapping' => {
type => 'string',
- pattern => '(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F]',
+ pattern => '(?:(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F])|([a-zA-Z][a-zA-Z0-9_-]+)',
},
},
},
@@ -171,7 +171,7 @@ __PACKAGE__->register_method ({
__PACKAGE__->register_method ({
name => 'mdevscan',
- path => '{pciid}/mdev',
+ path => '{pciid-or-mapping}/mdev',
method => 'GET',
description => "List mediated device types for given PCI device.",
protected => 1,
@@ -183,10 +183,10 @@ __PACKAGE__->register_method ({
additionalProperties => 0,
properties => {
node => get_standard_option('pve-node'),
- pciid => {
+ 'pciid-or-mapping' => {
type => 'string',
- pattern => '(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F]',
- description => "The PCI ID to list the mdev types for."
+ pattern => '(?:(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F])|([a-zA-Z][a-zA-Z0-9_-]+)',
+ description => "The PCI ID or mapping to list the mdev types for."
},
},
},
@@ -206,6 +206,12 @@ __PACKAGE__->register_method ({
},
description => {
type => 'string',
+ description => "Additional description of the type."
+ },
+ name => {
+ type => 'string',
+ optional => 1,
+ description => 'A human readable name for the type.',
},
},
},
@@ -213,5 +219,28 @@ __PACKAGE__->register_method ({
code => sub {
my ($param) = @_;
- return PVE::SysFSTools::get_mdev_types($param->{pciid});
+ if ($param->{'pciid-or-mapping'} =~ m/^(?:[0-9a-fA-F]{4}:)?[0-9a-fA-F]{2}:[0-9a-fA-F]{2}\.[0-9a-fA-F]$/) {
+ return PVE::SysFSTools::get_mdev_types($param->{'pciid-or-mapping'});
+ } else {
+ my $mapping = $param->{'pciid-or-mapping'};
+
+ my $types = {};
+ my $devices = PVE::Mapping::PCI::find_on_current_node($mapping);
+ for my $device ($devices->@*) {
+ my $id = $device->{path};
+ next if $id =~ m/;/; # mdev not supported for multifunction devices
+
+ my $device_types = PVE::SysFSTools::get_mdev_types($id);
+
+ for my $type_definition ($device_types->@*) {
+ my $type = $type_definition->{type};
+ if (!defined($types->{$type})) {
+ $types->{$type} = $type_definition;
+ }
+ }
+ }
+
+ return [sort { $a->{type} cmp $b->{type} } values($types->%*)];
+ }
+
}});
diff --git a/www/manager6/qemu/PCIEdit.js b/www/manager6/qemu/PCIEdit.js
index 970e81c2..7c1b7b85 100644
--- a/www/manager6/qemu/PCIEdit.js
+++ b/www/manager6/qemu/PCIEdit.js
@@ -67,17 +67,7 @@ Ext.define('PVE.qemu.PCIInputPanel', {
let path = value;
if (pciDev.data.map) {
- // find local mapping
- for (const entry of pciDev.data.map) {
- let mapping = PVE.Parser.parsePropertyString(entry);
- if (mapping.node === pcisel.up('inputpanel').nodename) {
- path = mapping.path.split(';')[0];
- break;
- }
- }
- if (path.indexOf('.') === -1) {
- path += '.0';
- }
+ path = pciDev.data.id;
}
if (pciDev.data.mdev) {
--
2.39.2
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings
2024-08-06 12:22 ` [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings Dominik Csapak
@ 2024-10-24 16:53 ` Thomas Lamprecht
2024-10-30 8:42 ` Dominik Csapak
0 siblings, 1 reply; 12+ messages in thread
From: Thomas Lamprecht @ 2024-10-24 16:53 UTC (permalink / raw)
To: Proxmox VE development discussion, Dominik Csapak
Am 06/08/2024 um 14:22 schrieb Dominik Csapak:
> currently when we have a pci resource mapping, we manually check only
> the available models for the first pci entry. This often works, but not
> always, since one could have completely different devices in one
> mapping, or with the new nvidia sysfs api we don't get the generally
> available models.
>
> To improve this, extend the 'pciid' regex to include pciids or mapping
> names, and for mappings, iterate over all local pci devices in it and
> extract the mdev types.
>
> This also vastly simplifies the ui code, since we only have to give the
> mapping to the selector instead of an (arbitrarily selected) pci id.
>
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> PVE/API2/Hardware/PCI.pm | 45 +++++++++++++++++++++++++++++-------
> www/manager6/qemu/PCIEdit.js | 12 +---------
> 2 files changed, 38 insertions(+), 19 deletions(-)
>
> diff --git a/PVE/API2/Hardware/PCI.pm b/PVE/API2/Hardware/PCI.pm
> index a3a689bf..7135a605 100644
> --- a/PVE/API2/Hardware/PCI.pm
> +++ b/PVE/API2/Hardware/PCI.pm
> @@ -135,7 +135,7 @@ __PACKAGE__->register_method ({
>
> __PACKAGE__->register_method ({
> name => 'pciindex',
> - path => '{pciid}',
> + path => '{pciid-or-mapping}',
> method => 'GET',
> description => "Index of available pci methods",
> permissions => {
> @@ -145,9 +145,9 @@ __PACKAGE__->register_method ({
> additionalProperties => 0,
> properties => {
> node => get_standard_option('pve-node'),
> - pciid => {
> + 'pciid-or-mapping' => {
While in general this should be fine w.r.t. breaking changes for the API itself,
as the name of those template-variables from the API path can be basically seen
as internal detail, it would IMO be warranted to state that explicitly in the
commit message, as FWICT this is basically the only time where changing the
parameter name mid-release is legit without some backward compat handling.
And just to be sure: you are certain that this cannot be passed anyhow else,
e.g. in some CLI (pvesh?) or other way?
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings
2024-10-24 16:53 ` Thomas Lamprecht
@ 2024-10-30 8:42 ` Dominik Csapak
2024-10-30 8:46 ` Thomas Lamprecht
0 siblings, 1 reply; 12+ messages in thread
From: Dominik Csapak @ 2024-10-30 8:42 UTC (permalink / raw)
To: Thomas Lamprecht, Proxmox VE development discussion
On 10/24/24 18:53, Thomas Lamprecht wrote:
> Am 06/08/2024 um 14:22 schrieb Dominik Csapak:
>> currently when we have a pci resource mapping, we manually check only
>> the available models for the first pci entry. This often works, but not
>> always, since one could have completely different devices in one
>> mapping, or with the new nvidia sysfs api we don't get the generally
>> available models.
>>
>> To improve this, extend the 'pciid' regex to include pciids or mapping
>> names, and for mappings, iterate over all local pci devices in it and
>> extract the mdev types.
>>
>> This also vastly simplifies the ui code, since we only have to give the
>> mapping to the selector instead of an (arbitrarily selected) pci id.
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>> PVE/API2/Hardware/PCI.pm | 45 +++++++++++++++++++++++++++++-------
>> www/manager6/qemu/PCIEdit.js | 12 +---------
>> 2 files changed, 38 insertions(+), 19 deletions(-)
>>
>> diff --git a/PVE/API2/Hardware/PCI.pm b/PVE/API2/Hardware/PCI.pm
>> index a3a689bf..7135a605 100644
>> --- a/PVE/API2/Hardware/PCI.pm
>> +++ b/PVE/API2/Hardware/PCI.pm
>> @@ -135,7 +135,7 @@ __PACKAGE__->register_method ({
>>
>> __PACKAGE__->register_method ({
>> name => 'pciindex',
>> - path => '{pciid}',
>> + path => '{pciid-or-mapping}',
>> method => 'GET',
>> description => "Index of available pci methods",
>> permissions => {
>> @@ -145,9 +145,9 @@ __PACKAGE__->register_method ({
>> additionalProperties => 0,
>> properties => {
>> node => get_standard_option('pve-node'),
>> - pciid => {
>> + 'pciid-or-mapping' => {
>
> While in general this should be fine w.r.t. breaking changes for the API itself,
> as the name of those template-variables from the API path can be basically seen
> as internal detail, it would IMO be warranted to state that explicitly in the
> commit message, as FWICT this is basically the only time where changing the
> parameter name mid-release is legit without some backward compat handling.
Yeah, you're right, i'll send a v2 of this with the commit message extended,
but see below for an alternative.
>
> And just to be sure: you are certain that this cannot be passed anyhow else,
> e.g. in some CLI (pvesh?) or other way?
I did not find a way, and AFAICT this parameter is always part of the path,
where we don't can give explicit names
Of course, we can leave the name of the parameter the same if you'd
prefer that (with a TODO notice for 9.x?), and just change the regex, then
it's definitely backwards compatible, even if i overlooked some case
where one can give the parameter name
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings
2024-10-30 8:42 ` Dominik Csapak
@ 2024-10-30 8:46 ` Thomas Lamprecht
0 siblings, 0 replies; 12+ messages in thread
From: Thomas Lamprecht @ 2024-10-30 8:46 UTC (permalink / raw)
To: Dominik Csapak, Proxmox VE development discussion
Am 30/10/2024 um 09:42 schrieb Dominik Csapak:
> I did not find a way, and AFAICT this parameter is always part of the path,
> where we don't can give explicit names
Thanks for checking.
> Of course, we can leave the name of the parameter the same if you'd
> prefer that (with a TODO notice for 9.x?), and just change the regex, then
> it's definitely backwards compatible, even if i overlooked some case
> where one can give the parameter name
No it should be fine, lets go for the change, if it's stated explicitly
why this is OK other (future) devs finding this will have context to know
why it was (deemed to be) fine here.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
` (5 preceding siblings ...)
2024-08-06 12:22 ` [pve-devel] [PATCH manager 1/1] api/ui: improve mdev listing for pci mappings Dominik Csapak
@ 2024-10-17 10:16 ` Christoph Heiss
2024-10-24 17:06 ` [pve-devel] partially-applied: " Thomas Lamprecht
7 siblings, 0 replies; 12+ messages in thread
From: Christoph Heiss @ 2024-10-17 10:16 UTC (permalink / raw)
To: Dominik Csapak; +Cc: Proxmox VE development discussion
Tested this entire series (+ the one prerequisite patch) using an RTX
A5000. Everything applied cleanly on latest master on each respective
repo. Tested on (latest) kernel 6.8.12-2-pve for reference.
vGPU datacenter resource mapping and VM PCI device setup worked fine as
it should, once I've got the drivers & co properly set up.
Verified the available vGPU PCI devices on the host as well as in the
VM, as well as using `nvidia-smi`. Further I've tested running some CUDA
workloads in the VM, to ensure everything is properly set up.
The only "regression" I've noticed is that the Nvidia devices now have
an empty description in the mdev device list, but it's not critically
and can be improved in the future - as also duly noted in the code.
Also looked through the code, looks good IMO. Having to special-case
it for "normal" mdevs and nvidia is rather unfortunate, but - also
talked off-list a bit with Dominik about it - implementing it as a
plugin system would be way more work than justifiable in this case.
So please consider the entire series:
Tested-by: Christoph Heiss <c.heiss@proxmox.com>
Reviewed-by: Christoph Heiss <c.heiss@proxmox.com>
On Tue, Aug 06, 2024 at 02:21:57PM GMT, Dominik Csapak wrote:
> For many new cards, nvidia changed the kernel interface since kernel
> verion 6.8. Instead of using mediated devices, they provide their own
> api.
>
> This series adapts to that, with no required change to the vm config,
> and only minimal changes to our api.
>
> The biggest change is that the mdev types can now be queried on
> /nodes/NODE/hardware/pci/<pciid-or-mapping/mdev either via a pci id
> (like it was before) or via the name of a pci mapping (now checks all
> local devices from that mapping)
>
> A thing to improve could be to parse the available vgpu types from
> nvidia-smi instead of the sysfs, since that not always contains all
> types (see the common patch 1/2 for details)
>
> We could abstract the code that deals with different types probably a
> bit more, but for me it seems Ok for now, and finding a good API for
> that is hard with only 3 modes that are very different from each other
> (raw/mdev/nvidia).
>
> qemu-server patches depend on the common patches, but the manager patch
> does not rely on any other in this series. It is required though
> for the user to be able to select types (in certain conditions).
>
> note that this series requires my previous patch to the sysfstools to
> improve write reliability[0], otherwise the cleanup or creation may
> fail.
>
> 0: https://lists.proxmox.com/pipermail/pve-devel/2024-July/064814.html
>
> pve-common:
>
> Dominik Csapak (2):
> SysFSTools: handle new nvidia syfsapi as mdev
> SysFSTools: lscpi: move mdev and iommugroup check outside of verbose
>
> src/PVE/SysFSTools.pm | 83 ++++++++++++++++++++++++++-----------------
> 1 file changed, 51 insertions(+), 32 deletions(-)
>
> qemu-server:
>
> Dominik Csapak (3):
> pci: choose devices: don't reserve pciids when vm is already running
> pci: remove pci reservation: optionally give list of ids to remove
> pci: mdev: adapt to nvidia interface with kernel >= 6.8
>
> PVE/QemuServer.pm | 30 +++++++++--
> PVE/QemuServer/PCI.pm | 92 +++++++++++++++++++++++++++++---
> test/run_config2command_tests.pl | 8 ++-
> 3 files changed, 118 insertions(+), 12 deletions(-)
>
> pve-manager:
>
> Dominik Csapak (1):
> api/ui: improve mdev listing for pci mappings
>
> PVE/API2/Hardware/PCI.pm | 45 +++++++++++++++++++++++++++++-------
> www/manager6/qemu/PCIEdit.js | 12 +---------
> 2 files changed, 38 insertions(+), 19 deletions(-)
>
> --
> 2.39.2
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* [pve-devel] partially-applied: [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes
2024-08-06 12:21 [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Dominik Csapak
` (6 preceding siblings ...)
2024-10-17 10:16 ` [pve-devel] [PATCH common/qemu-server/manager] adapt to nvidia vgpu api changes Christoph Heiss
@ 2024-10-24 17:06 ` Thomas Lamprecht
7 siblings, 0 replies; 12+ messages in thread
From: Thomas Lamprecht @ 2024-10-24 17:06 UTC (permalink / raw)
To: Proxmox VE development discussion, Dominik Csapak
Am 06/08/2024 um 14:21 schrieb Dominik Csapak:
> For many new cards, nvidia changed the kernel interface since kernel
> verion 6.8. Instead of using mediated devices, they provide their own
> api.
>
> This series adapts to that, with no required change to the vm config,
> and only minimal changes to our api.
>
> The biggest change is that the mdev types can now be queried on
> /nodes/NODE/hardware/pci/<pciid-or-mapping/mdev either via a pci id
> (like it was before) or via the name of a pci mapping (now checks all
> local devices from that mapping)
>
> A thing to improve could be to parse the available vgpu types from
> nvidia-smi instead of the sysfs, since that not always contains all
> types (see the common patch 1/2 for details)
>
> We could abstract the code that deals with different types probably a
> bit more, but for me it seems Ok for now, and finding a good API for
> that is hard with only 3 modes that are very different from each other
> (raw/mdev/nvidia).
>
> qemu-server patches depend on the common patches, but the manager patch
> does not rely on any other in this series. It is required though
> for the user to be able to select types (in certain conditions).
>
> note that this series requires my previous patch to the sysfstools to
> improve write reliability[0], otherwise the cleanup or creation may
> fail.
>
> 0: https://lists.proxmox.com/pipermail/pve-devel/2024-July/064814.html
>
> pve-common:
>
> Dominik Csapak (2):
> SysFSTools: handle new nvidia syfsapi as mdev
> SysFSTools: lscpi: move mdev and iommugroup check outside of verbose
>
> src/PVE/SysFSTools.pm | 83 ++++++++++++++++++++++++++-----------------
> 1 file changed, 51 insertions(+), 32 deletions(-)
>
> qemu-server:
>
> Dominik Csapak (3):
> pci: choose devices: don't reserve pciids when vm is already running
> pci: remove pci reservation: optionally give list of ids to remove
> pci: mdev: adapt to nvidia interface with kernel >= 6.8
>
> PVE/QemuServer.pm | 30 +++++++++--
> PVE/QemuServer/PCI.pm | 92 +++++++++++++++++++++++++++++---
> test/run_config2command_tests.pl | 8 ++-
> 3 files changed, 118 insertions(+), 12 deletions(-)
>
> pve-manager:
>
> Dominik Csapak (1):
> api/ui: improve mdev listing for pci mappings
>
> PVE/API2/Hardware/PCI.pm | 45 +++++++++++++++++++++++++++++-------
> www/manager6/qemu/PCIEdit.js | 12 +---------
> 2 files changed, 38 insertions(+), 19 deletions(-)
>
applied all but the pve-manager patch for now with Christoph's T-b and R-b, thanks!
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 12+ messages in thread