From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [RFC qemu-server v2 3/4] fix #6378 (continued): warn intel-iommu users about iommu and host aw bits mismatch
Date: Tue, 2 Sep 2025 13:22:00 +0200 [thread overview]
Message-ID: <20250902112307.124706-5-d.kral@proxmox.com> (raw)
In-Reply-To: <20250902112307.124706-1-d.kral@proxmox.com>
For certain host CPUs, such as Intel consumer-grade CPUs, there is a
frequent mismatch between the CPU's physical address width and the
IOMMU's address width.
If a virtual machine is setup with an intel-iommu device, qemu allocates
and maps the (virtual) I/O address space (IOAS) for a VFIO passthrough
device with iommufd.
In case of a mismatch of the address width of the host CPU and IOMMU
CPU, the guest physical address space (GPAS) and memory-type range
registers (MTRRs) are setup to the host CPU's address width, which
causes IOAS to be allocated and mapped outside of the IOMMU's maximum
guest address width (MGAW) and causes the following error from qemu (the
error message is copied from the user forum [0]):
kvm: vfio_container_dma_map(0x5c9222494280, 0x380000000000, 0x10000, 0x78075ee70000) = -22 (Invalid argument)
This error is rather confusing and unhelpful to users, so warn them
about a CPU physical address width that exceeds the IOMMU address width.
[0] https://forum.proxmox.com/threads/vm-wont-start-with-pci-passthrough-after-upgrade-to-9-0.169586/page-3#post-795717
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
I already talked about this with @Fiona off-list, but the code this
adds to qemu-server only for a warning is quite a lot, but is more
readable than the above error that is only issued when the VM is already
run.
Particularily, I don't like the logic duplication of
get_cpu_address_width(...), which tries to copy what
target/i386/{,host-,kvm/kvm-}cpu.c do to retrieve the {,guest_}phys_bits
value, where I'd rather see this implemented in pve-qemu as in [0].
There are two qemu and edk2 discussion threads that might help in
deciding how to go with this patch [0] [1]. It could also be better to
implement this downstream in pve-qemu for now similar to [0], or of
course contribute to upstream with an actual fix.
[0] https://lore.kernel.org/qemu-devel/20250130115800.60b7cbe6.alex.williamson@redhat.com/
[1] https://edk2.groups.io/g/devel/topic/patch_v1/102359124
src/PVE/QemuServer.pm | 7 ++-
src/PVE/QemuServer/CPUConfig.pm | 46 +++++++++++++++++--
src/PVE/QemuServer/Machine.pm | 13 +++++-
.../q35-viommu-intel-exceeding-aw-bits.conf | 4 ++
...35-viommu-intel-exceeding-aw-bits.conf.cmd | 25 ++++++++++
5 files changed, 88 insertions(+), 7 deletions(-)
create mode 100644 src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf
create mode 100644 src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf.cmd
diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index 04e988c7..6d31bf40 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -61,7 +61,7 @@ use PVE::QemuServer::Helpers
use PVE::QemuServer::Cloudinit;
use PVE::QemuServer::CGroup;
use PVE::QemuServer::CPUConfig
- qw(print_cpu_device get_cpu_options get_cpu_bitness is_native_arch get_amd_sev_object get_amd_sev_type);
+ qw(print_cpu_device get_cpu_options get_cpu_bitness get_cpu_address_width is_native_arch get_amd_sev_object get_amd_sev_type);
use PVE::QemuServer::Drive qw(
is_valid_drivename
checked_volume_format
@@ -3901,6 +3901,11 @@ sub config_to_command {
push @$machineFlags, "type=${machine_type_min}";
PVE::QemuServer::Machine::assert_valid_machine_property($machine_conf);
+ PVE::QemuServer::Machine::check_valid_iommu_address_width(
+ $machine_conf,
+ $machine_version,
+ get_cpu_address_width($conf->{cpu}, $arch, $cpuinfo->{phys_bits}),
+ );
if (my $viommu = $machine_conf->{viommu}) {
my $viommu_devstr = '';
diff --git a/src/PVE/QemuServer/CPUConfig.pm b/src/PVE/QemuServer/CPUConfig.pm
index f57275dd..4671ead9 100644
--- a/src/PVE/QemuServer/CPUConfig.pm
+++ b/src/PVE/QemuServer/CPUConfig.pm
@@ -16,6 +16,7 @@ our @EXPORT_OK = qw(
print_cpu_device
get_cpu_options
get_cpu_bitness
+ get_cpu_address_width
is_native_arch
get_amd_sev_object
get_amd_sev_type
@@ -681,8 +682,21 @@ sub get_cpu_options {
$pve_forced_flags,
);
+ my $phys_bits_options = get_cpu_phys_bits_options($cpu, $custom_cpu);
+ for my $key (sort keys %$phys_bits_options) {
+ $cpu_str .= ",$key=$phys_bits_options->{$key}";
+ }
+
+ return ('-cpu', $cpu_str);
+}
+
+sub get_cpu_phys_bits_options {
+ my ($cpu, $custom_cpu) = @_;
+
+ my $phys_bits_options = {};
+
for my $phys_bits_opt (qw(guest-phys-bits phys-bits)) {
- my $phys_bits = '';
+ my ($key, $value) = ($phys_bits_opt, undef);
foreach my $conf ($custom_cpu, $cpu) {
next if !defined($conf);
my $conf_val = $conf->{$phys_bits_opt};
@@ -690,15 +704,15 @@ sub get_cpu_options {
if ($conf_val eq 'host') {
die "unexpected value 'host' for guest-phys-bits"
if $phys_bits_opt eq 'guest-phys-bits';
- $phys_bits = ",host-phys-bits=true";
+ ($key, $value) = ('host-phys-bits', 'true');
} else {
- $phys_bits = ",${phys_bits_opt}=${conf_val}";
+ $value = $conf_val;
}
}
- $cpu_str .= $phys_bits;
+ $phys_bits_options->{$key} = $value if $value;
}
- return ('-cpu', $cpu_str);
+ return $phys_bits_options;
}
# Some hardcoded flags required by certain configurations
@@ -844,6 +858,28 @@ sub get_cpu_bitness {
die "unsupported architecture '$arch'\n";
}
+sub get_cpu_address_width {
+ my ($cpu_prop_str, $arch, $host_phys_bits) = @_;
+
+ $arch //= get_host_arch();
+
+ my ($cputype, $cpu, $custom_cpu) = get_cpu_properties($cpu_prop_str, $arch);
+ my $phys_bits_options = get_cpu_phys_bits_options($cpu, $custom_cpu);
+ my ($phys_bits, $guest_phys_bits) = $phys_bits_options->@{qw(phys-bits guest-phys-bits)};
+
+ my $cpu_aw_bits = 0;
+ $cpu_aw_bits = $guest_phys_bits if $guest_phys_bits;
+ $cpu_aw_bits = $phys_bits if $phys_bits && $cpu_aw_bits > $phys_bits;
+ $cpu_aw_bits = $phys_bits if $phys_bits && !$cpu_aw_bits;
+ $cpu_aw_bits = $host_phys_bits if $host_phys_bits && !$cpu_aw_bits;
+ $cpu_aw_bits = 40 if !$cpu_aw_bits; # fallback to TCG_PHYS_ADDR_BITS
+
+ return int($cpu_aw_bits) if $arch eq 'x86_64';
+ return undef if $arch eq 'aarch64';
+
+ die "unsupported architecture '$arch'\n";
+}
+
sub get_hw_capabilities {
# Get reduced-phys-bits & cbitpos from host-hw-capabilities.json
# TODO: Find better location than /run/qemu-server/
diff --git a/src/PVE/QemuServer/Machine.pm b/src/PVE/QemuServer/Machine.pm
index 57d583c2..c083a27b 100644
--- a/src/PVE/QemuServer/Machine.pm
+++ b/src/PVE/QemuServer/Machine.pm
@@ -3,7 +3,7 @@ package PVE::QemuServer::Machine;
use strict;
use warnings;
-use PVE::QemuServer::Helpers;
+use PVE::QemuServer::Helpers qw(min_version);
use PVE::QemuServer::MetaInfo;
use PVE::QemuServer::Monitor;
use PVE::JSONSchema qw(get_standard_option parse_property_string print_property_string);
@@ -133,6 +133,17 @@ sub assert_valid_machine_property {
}
}
+sub check_valid_iommu_address_width {
+ my ($machine_conf, $machine_version, $cpu_aw_bits) = @_;
+ if ($machine_conf->{viommu} && $machine_conf->{viommu} eq 'intel') {
+ my $iommu_aw_bits_default = min_version($machine_version, 9, 2) ? 48 : 39;
+ my $iommu_aw_bits = $machine_conf->{'aw-bits'} // $iommu_aw_bits_default;
+
+ warn "guest address width exceeds vIOMMU address width: $cpu_aw_bits > $iommu_aw_bits\n"
+ if $cpu_aw_bits && $iommu_aw_bits && $cpu_aw_bits > $iommu_aw_bits;
+ }
+}
+
sub machine_type_is_q35 {
my ($conf) = @_;
diff --git a/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf b/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf
new file mode 100644
index 00000000..d6cff715
--- /dev/null
+++ b/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf
@@ -0,0 +1,4 @@
+# TEST: Check if exceeding guest-phys-bits > iommu aw-bits is correctly warned about
+# EXPECT_WARN: guest address width exceeds vIOMMU address width: 46 > 39
+cpu: host,guest-phys-bits=46
+machine: q35,viommu=intel,aw-bits=39
diff --git a/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf.cmd b/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf.cmd
new file mode 100644
index 00000000..0ec488ae
--- /dev/null
+++ b/src/test/cfg2cmd/q35-viommu-intel-exceeding-aw-bits.conf.cmd
@@ -0,0 +1,25 @@
+/usr/bin/kvm \
+ -id 8006 \
+ -name 'vm8006,debug-threads=on' \
+ -no-shutdown \
+ -chardev 'socket,id=qmp,path=/var/run/qemu-server/8006.qmp,server=on,wait=off' \
+ -mon 'chardev=qmp,mode=control' \
+ -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect-ms=5000' \
+ -mon 'chardev=qmp-event,mode=control' \
+ -pidfile /var/run/qemu-server/8006.pid \
+ -daemonize \
+ -smp '1,sockets=1,cores=1,maxcpus=1' \
+ -nodefaults \
+ -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
+ -vnc 'unix:/var/run/qemu-server/8006.vnc,password=on' \
+ -cpu 'host,+kvm_pv_eoi,+kvm_pv_unhalt,guest-phys-bits=46' \
+ -m 512 \
+ -global 'ICH9-LPC.disable_s3=1' \
+ -global 'ICH9-LPC.disable_s4=1' \
+ -device 'intel-iommu,intremap=on,caching-mode=on,aw-bits=39' \
+ -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg \
+ -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' \
+ -device 'VGA,id=vga,bus=pcie.0,addr=0x1' \
+ -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' \
+ -iscsi 'initiator-name=iqn.1993-08.org.debian:01:aabbccddeeff' \
+ -machine 'type=q35+pve0,kernel-irqchip=split'
--
2.47.2
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
next prev parent reply other threads:[~2025-09-02 11:23 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-02 11:21 [pve-devel] [PATCH common/qemu-server v2 0/5] fix issues with viommu+vfio passthrough in #6608, #6378 Daniel Kral
2025-09-02 11:21 ` [pve-devel] [PATCH common v2 1/1] procfs: cpuinfo: expose x86_phys_bits and x86_virt_bits values Daniel Kral
2025-09-05 9:10 ` Fiona Ebner
2025-09-05 11:47 ` Daniel Kral
2025-09-02 11:21 ` [pve-devel] [PATCH qemu-server v2 1/4] fix #6608: expose viommu driver aw-bits option Daniel Kral
2025-09-05 10:07 ` Fiona Ebner
2025-09-05 11:45 ` Daniel Kral
2025-09-05 12:00 ` Fiona Ebner
2025-09-05 14:18 ` Daniel Kral
2025-09-02 11:21 ` [pve-devel] [PATCH qemu-server v2 2/4] cpu config: factor out gathering common cpu properties Daniel Kral
2025-09-05 10:32 ` Fiona Ebner
2025-09-02 11:22 ` Daniel Kral [this message]
2025-09-02 11:26 ` [pve-devel] [RFC qemu-server v2 3/4] fix #6378 (continued): warn intel-iommu users about iommu and host aw bits mismatch Daniel Kral
2025-09-05 10:50 ` Fiona Ebner
2025-09-05 11:38 ` Daniel Kral
2025-09-05 12:52 ` Fiona Ebner
2025-09-02 11:22 ` [pve-devel] [RFC qemu-server v2 4/4] machine: warn intel-iommu users about too large address width Daniel Kral
2025-09-05 10:55 ` Fiona Ebner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250902112307.124706-5-d.kral@proxmox.com \
--to=d.kral@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.