[pve-devel] [RFC qemu-server v2 3/4] fix #6378 (continued): warn intel-iommu users about iommu and host aw bits mismatch

From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [RFC qemu-server v2 3/4] fix #6378 (continued): warn intel-iommu users about iommu and host aw bits mismatch
Date: Tue,  2 Sep 2025 13:22:00 +0200	[thread overview]
Message-ID: <20250902112307.124706-5-d.kral@proxmox.com> (raw)
In-Reply-To: <20250902112307.124706-1-d.kral@proxmox.com>

For certain host CPUs, such as Intel consumer-grade CPUs, there is a
frequent mismatch between the CPU's physical address width and the
IOMMU's address width.

If a virtual machine is setup with an intel-iommu device, qemu allocates
and maps the (virtual) I/O address space (IOAS) for a VFIO passthrough
device with iommufd.

In case of a mismatch of the address width of the host CPU and IOMMU
CPU, the guest physical address space (GPAS) and memory-type range
registers (MTRRs) are setup to the host CPU's address width, which
causes IOAS to be allocated and mapped outside of the IOMMU's maximum
guest address width (MGAW) and causes the following error from qemu (the
error message is copied from the user forum [0]):

    kvm: vfio_container_dma_map(0x5c9222494280, 0x380000000000, 0x10000, 0x78075ee70000) = -22 (Invalid argument)

This error is rather confusing and unhelpful to users, so warn them
about a CPU physical address width that exceeds the IOMMU address width.

[0] https://forum.proxmox.com/threads/vm-wont-start-with-pci-passthrough-after-upgrade-to-9-0.169586/page-3#post-795717

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
I already talked about this with @Fiona off-list, but the code this
adds to qemu-server only for a warning is quite a lot, but is more
readable than the above error that is only issued when the VM is already
run.

Particularily, I don't like the logic duplication of
get_cpu_address_width(...), which tries to copy what
target/i386/{,host-,kvm/kvm-}cpu.c do to retrieve the {,guest_}phys_bits
value, where I'd rather see this implemented in pve-qemu as in [0].

There are two qemu and edk2 discussion threads that might help in
deciding how to go with this patch [0] [1]. It could also be better to
implement this downstream in pve-qemu for now similar to [0], or of
course contribute to upstream with an actual fix.

[0] https://lore.kernel.org/qemu-devel/20250130115800.60b7cbe6.alex.williamson@redhat.com/
[1] https://edk2.groups.io/g/devel/topic/patch_v1/102359124

 src/PVE/QemuServer.pm                         |  7 ++-
 src/PVE/QemuServer/CPUConfig.pm               | 46 +++++++++++++++++--
 src/PVE/QemuServer/Machine.pm                 | 13 +++++-
 .../q35-viommu-intel-exceeding-aw-bits.conf   |  4 ++
 ...35-viommu-intel-exceeding-aw-bits.conf.cmd | 25 ++++++++++
 5 files changed, 88 insertions(+), 7 deletions(-)
 create mode 100644 src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf
 create mode 100644 src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf.cmd

diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index 04e988c7..6d31bf40 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -61,7 +61,7 @@ use PVE::QemuServer::Helpers
 use PVE::QemuServer::Cloudinit;
 use PVE::QemuServer::CGroup;
 use PVE::QemuServer::CPUConfig
-    qw(print_cpu_device get_cpu_options get_cpu_bitness is_native_arch get_amd_sev_object get_amd_sev_type);
+    qw(print_cpu_device get_cpu_options get_cpu_bitness get_cpu_address_width is_native_arch get_amd_sev_object get_amd_sev_type);
 use PVE::QemuServer::Drive qw(
     is_valid_drivename
     checked_volume_format
@@ -3901,6 +3901,11 @@ sub config_to_command {
     push @$machineFlags, "type=${machine_type_min}";
 
     PVE::QemuServer::Machine::assert_valid_machine_property($machine_conf);
+    PVE::QemuServer::Machine::check_valid_iommu_address_width(
+        $machine_conf,
+        $machine_version,
+        get_cpu_address_width($conf->{cpu}, $arch, $cpuinfo->{phys_bits}),
+    );
 
     if (my $viommu = $machine_conf->{viommu}) {
         my $viommu_devstr = '';
diff --git a/src/PVE/QemuServer/CPUConfig.pm b/src/PVE/QemuServer/CPUConfig.pm
index f57275dd..4671ead9 100644
--- a/src/PVE/QemuServer/CPUConfig.pm
+++ b/src/PVE/QemuServer/CPUConfig.pm
@@ -16,6 +16,7 @@ our @EXPORT_OK = qw(
     print_cpu_device
     get_cpu_options
     get_cpu_bitness
+    get_cpu_address_width
     is_native_arch
     get_amd_sev_object
     get_amd_sev_type
@@ -681,8 +682,21 @@ sub get_cpu_options {
         $pve_forced_flags,
     );
 
+    my $phys_bits_options = get_cpu_phys_bits_options($cpu, $custom_cpu);
+    for my $key (sort keys %$phys_bits_options) {
+        $cpu_str .= ",$key=$phys_bits_options->{$key}";
+    }
+
+    return ('-cpu', $cpu_str);
+}
+
+sub get_cpu_phys_bits_options {
+    my ($cpu, $custom_cpu) = @_;
+
+    my $phys_bits_options = {};
+
     for my $phys_bits_opt (qw(guest-phys-bits phys-bits)) {
-        my $phys_bits = '';
+        my ($key, $value) = ($phys_bits_opt, undef);
         foreach my $conf ($custom_cpu, $cpu) {
             next if !defined($conf);
             my $conf_val = $conf->{$phys_bits_opt};
@@ -690,15 +704,15 @@ sub get_cpu_options {
             if ($conf_val eq 'host') {
                 die "unexpected value 'host' for guest-phys-bits"
                     if $phys_bits_opt eq 'guest-phys-bits';
-                $phys_bits = ",host-phys-bits=true";
+                ($key, $value) = ('host-phys-bits', 'true');
             } else {
-                $phys_bits = ",${phys_bits_opt}=${conf_val}";
+                $value = $conf_val;
             }
         }
-        $cpu_str .= $phys_bits;
+        $phys_bits_options->{$key} = $value if $value;
     }
 
-    return ('-cpu', $cpu_str);
+    return $phys_bits_options;
 }
 
 # Some hardcoded flags required by certain configurations
@@ -844,6 +858,28 @@ sub get_cpu_bitness {
     die "unsupported architecture '$arch'\n";
 }
 
+sub get_cpu_address_width {
+    my ($cpu_prop_str, $arch, $host_phys_bits) = @_;
+
+    $arch //= get_host_arch();
+
+    my ($cputype, $cpu, $custom_cpu) = get_cpu_properties($cpu_prop_str, $arch);
+    my $phys_bits_options = get_cpu_phys_bits_options($cpu, $custom_cpu);
+    my ($phys_bits, $guest_phys_bits) = $phys_bits_options->@{qw(phys-bits guest-phys-bits)};
+
+    my $cpu_aw_bits = 0;
+    $cpu_aw_bits = $guest_phys_bits if $guest_phys_bits;
+    $cpu_aw_bits = $phys_bits if $phys_bits && $cpu_aw_bits > $phys_bits;
+    $cpu_aw_bits = $phys_bits if $phys_bits && !$cpu_aw_bits;
+    $cpu_aw_bits = $host_phys_bits if $host_phys_bits && !$cpu_aw_bits;
+    $cpu_aw_bits = 40 if !$cpu_aw_bits; # fallback to TCG_PHYS_ADDR_BITS
+
+    return int($cpu_aw_bits) if $arch eq 'x86_64';
+    return undef if $arch eq 'aarch64';
+
+    die "unsupported architecture '$arch'\n";
+}
+
 sub get_hw_capabilities {
     # Get reduced-phys-bits & cbitpos from host-hw-capabilities.json
     # TODO: Find better location than /run/qemu-server/
diff --git a/src/PVE/QemuServer/Machine.pm b/src/PVE/QemuServer/Machine.pm
index 57d583c2..c083a27b 100644
--- a/src/PVE/QemuServer/Machine.pm
+++ b/src/PVE/QemuServer/Machine.pm
@@ -3,7 +3,7 @@ package PVE::QemuServer::Machine;
 use strict;
 use warnings;
 
-use PVE::QemuServer::Helpers;
+use PVE::QemuServer::Helpers qw(min_version);
 use PVE::QemuServer::MetaInfo;
 use PVE::QemuServer::Monitor;
 use PVE::JSONSchema qw(get_standard_option parse_property_string print_property_string);
@@ -133,6 +133,17 @@ sub assert_valid_machine_property {
     }
 }
 
+sub check_valid_iommu_address_width {
+    my ($machine_conf, $machine_version, $cpu_aw_bits) = @_;
+    if ($machine_conf->{viommu} && $machine_conf->{viommu} eq 'intel') {
+        my $iommu_aw_bits_default = min_version($machine_version, 9, 2) ? 48 : 39;
+        my $iommu_aw_bits = $machine_conf->{'aw-bits'} // $iommu_aw_bits_default;
+
+        warn "guest address width exceeds vIOMMU address width: $cpu_aw_bits > $iommu_aw_bits\n"
+            if $cpu_aw_bits && $iommu_aw_bits && $cpu_aw_bits > $iommu_aw_bits;
+    }
+}
+
 sub machine_type_is_q35 {
     my ($conf) = @_;
 
diff --git a/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf b/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf
new file mode 100644
index 00000000..d6cff715
--- /dev/null
+++ b/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf
@@ -0,0 +1,4 @@
+# TEST: Check if exceeding guest-phys-bits > iommu aw-bits is correctly warned about
+# EXPECT_WARN: guest address width exceeds vIOMMU address width: 46 > 39
+cpu: host,guest-phys-bits=46
+machine: q35,viommu=intel,aw-bits=39
diff --git a/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf.cmd b/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf.cmd
new file mode 100644
index 00000000..0ec488ae
--- /dev/null
+++ b/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf.cmd
@@ -0,0 +1,25 @@
+/usr/bin/kvm \
+  -id 8006 \
+  -name 'vm8006,debug-threads=on' \
+  -no-shutdown \
+  -chardev 'socket,id=qmp,path=/var/run/qemu-server/8006.qmp,server=on,wait=off' \
+  -mon 'chardev=qmp,mode=control' \
+  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect-ms=5000' \
+  -mon 'chardev=qmp-event,mode=control' \
+  -pidfile /var/run/qemu-server/8006.pid \
+  -daemonize \
+  -smp '1,sockets=1,cores=1,maxcpus=1' \
+  -nodefaults \
+  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
+  -vnc 'unix:/var/run/qemu-server/8006.vnc,password=on' \
+  -cpu 'host,+kvm_pv_eoi,+kvm_pv_unhalt,guest-phys-bits=46' \
+  -m 512 \
+  -global 'ICH9-LPC.disable_s3=1' \
+  -global 'ICH9-LPC.disable_s4=1' \
+  -device 'intel-iommu,intremap=on,caching-mode=on,aw-bits=39' \
+  -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg \
+  -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' \
+  -device 'VGA,id=vga,bus=pcie.0,addr=0x1' \
+  -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' \
+  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:aabbccddeeff' \
+  -machine 'type=q35+pve0,kernel-irqchip=split'
-- 
2.47.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel