From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 920AF1FF137 for ; Tue, 17 Feb 2026 12:48:00 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 9C5C43500D; Tue, 17 Feb 2026 12:48:50 +0100 (CET) From: Dominik Csapak To: pve-devel@lists.proxmox.com Subject: [RFC PATCH qemu-server] fix #7282: allow (NUMA aware) vCPU pinning Date: Tue, 17 Feb 2026 12:01:24 +0100 Message-ID: <20260217114813.2063770-1-d.csapak@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.031 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: F3KBHUCOMHLRB4M55RN6ZDU2D4WMZDJX X-Message-ID-Hash: F3KBHUCOMHLRB4M55RN6ZDU2D4WMZDJX X-MailFrom: d.csapak@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Introduce a new 'pinning' property, which (for now) has two methods for vCPU pinning: * numa: pins the vCPUs of each virtual NUMA node to a corresponding host NUMA node * one-to-one: pins each vCPU to a corresponding host CPU, while keeping virtual NUMA nodes together on host NUMA nodes both options respect the 'affinity' setting and the various 'numaX' setting as good as possible. There are some possible configurations that are not supported, but these don't make sense from a performance standpoint (where pinning would be necessary) anyway. These not supported configurations are when more than one host NUMA node is assigned to one virtual NUMA node and when more NUMA nodes in the guest are defined than available on the host, and they're not explicitly mapped to host NUMA nodes. Signed-off-by: Dominik Csapak --- sending as RFC for now, because I'm still working on implementing some tests and doing some methodical testing. preliminary tests show that the pinning (both 'numa' and 'one-to-one', name pending), show with 'pcm-numa' that the cross numa node ram access is greatly reduced. I tested with 'cpubench1a' and a vm with 1 socket 8 cores with default config (no numa, no affinity, no pinning) gets ~50M local dram accesses and ~50M remote dram accesses per second but with pinning i can see ~130M local dram accesses and <1M remote ones. The weird thing is the more specific I pin, the lower the score from the benchmark gets: no settings: ~1600 with numa pinning: ~1550 with numa pinning without hyperthreads: ~1450 pin to specific cores (without hyperthreads): ~1450 so I'm still investigating if it's the benchmark or if my assumptions about performance here are wrong. I'll test another benchmark too and in my next version i'll include the results in the commit message. src/PVE/QemuServer.pm | 12 ++ src/PVE/QemuServer/Makefile | 1 + src/PVE/QemuServer/Pinning.pm | 388 ++++++++++++++++++++++++++++++++++ 3 files changed, 401 insertions(+) create mode 100644 src/PVE/QemuServer/Pinning.pm diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm index 5d2dbe03..f055f33f 100644 --- a/src/PVE/QemuServer.pm +++ b/src/PVE/QemuServer.pm @@ -85,6 +85,7 @@ use PVE::QemuServer::Memory qw(get_current_memory); use PVE::QemuServer::MetaInfo; use PVE::QemuServer::Monitor qw(mon_cmd); use PVE::QemuServer::Network; +use PVE::QemuServer::Pinning; use PVE::QemuServer::OVMF; use PVE::QemuServer::PCI qw(print_pci_addr print_pcie_addr print_pcie_root_port parse_hostpci); use PVE::QemuServer::QemuImage; @@ -735,6 +736,12 @@ EODESCR optional => 1, default => 1, }, + pinning => { + type => 'string', + format => $PVE::QemuServer::Pinning::pinning_fmt, + description => "Set pinning options for the guest", + optional => 1, + }, }; my $cicustom_fmt = { @@ -3174,6 +3181,8 @@ sub config_to_command { push @$cmd, '/usr/bin/taskset', '--cpu-list', '--all-tasks', $conf->{affinity}; } + PVE::QemuServer::Pinning::assert_pinning_constraints($conf); + push @$cmd, $kvm_binary; push @$cmd, '-id', $vmid; @@ -5816,6 +5825,9 @@ sub vm_start_nolock { syslog("info", "VM $vmid started with PID $pid."); + eval { PVE::QemuServer::Pinning::pin_threads_to_cpus($conf, $vmid) }; + log_warn("could not pin vCPU threads - $@") if $@; + if (defined(my $migrate = $res->{migrate})) { if ($migrate->{proto} eq 'tcp') { my $nodename = nodename(); diff --git a/src/PVE/QemuServer/Makefile b/src/PVE/QemuServer/Makefile index d599ca91..9edcbf6b 100644 --- a/src/PVE/QemuServer/Makefile +++ b/src/PVE/QemuServer/Makefile @@ -21,6 +21,7 @@ SOURCES=Agent.pm \ Network.pm \ OVMF.pm \ PCI.pm \ + Pinning.pm \ QemuImage.pm \ QMPHelpers.pm \ QSD.pm \ diff --git a/src/PVE/QemuServer/Pinning.pm b/src/PVE/QemuServer/Pinning.pm new file mode 100644 index 00000000..c3bc870f --- /dev/null +++ b/src/PVE/QemuServer/Pinning.pm @@ -0,0 +1,388 @@ +package PVE::QemuServer::Pinning; + +use v5.36; + +use PVE::QemuServer::Memory; +use PVE::QemuServer::Monitor; + +use PVE::Tools qw(dir_glob_foreach run_command); + +=head1 NAME + +C - Functions for pinning vCPUs of QEMU guests + +=head1 DESCRIPTION + +This module contains functions for helping with pinning QEMU guest vCPUS to +host CPUs, considering other parts of the config like NUMA nodes and affinity. + +Before the guest is started C should be called to +check the pinning constraints. + +After the guest is started, C is called to actually pin +the vCPU threads to the host CPUs according to the config. + +=cut + +my $DEFAULT_VCPU_PINNING = "none"; + +our $pinning_fmt = { + 'vcpus' => { + type => 'string', + enum => [qw(one-to-one numa $DEFAULT_VCPU_PINNING)], + description => "Set the type of vCPU pinning modes.", + verbose_description => < 'none', + default_key => 1, + }, +}; + +=head2 Restrictions and Constraints + +To keep it simple, pinning is limited to a subset of possible NUMA +configurations. For example, with 'numaX' configs, it's possible to create +overlapping virtual NUMA nodes, e.g. assigning vCPUs 0-3 to host NUMA nodes 0-1 +and vCPUs 4-7 to host NUMA nodes 1-2. To keep the pinning logic simple, such +configurations are not supported. Instead the user should assign virtual NUMA +nodes only to a single host NUMA node. + +=cut + +=head2 host_numa_node_cpu_list + +returns a map of host numa nodes to cpus + +=cut + +my sub host_numa_node_cpu_list { + my $base_path = "/sys/devices/system/node/"; + + my $map = {}; + my $count = 0; + + dir_glob_foreach( + $base_path, + 'node(\d+)', + sub { + my ($fullnode, $nodeid) = @_; + opendir(my $dirfd, "$base_path/$fullnode") + || die "cannot open numa node dir $fullnode\n"; + # this is a hash so we can use the key randomness for selecting + my $cpus = {}; + for my $cpu (readdir($dirfd)) { + if ($cpu =~ m/^cpu(\d+)$/) { + $cpus->{$1} = 1; + $count++; + } + } + closedir($dirfd); + + $map->{$nodeid} = $cpus; + }, + ); + + return ($count, $map); +} + +=head2 limit_by_affinity + +limits a list of map of numa nodes to cpu by a the given affinity map + +=cut + +my sub limit_by_affinity($host_cpus, $affinity_members) { + return $host_cpus if !$affinity_members || scalar($affinity_members->%*) == 0; + + for my $node_id (keys $host_cpus->%*) { + my $node = $host_cpus->{$node_id}; + for my $cpu_id (keys $node->%*) { + delete $node->{$cpu_id} if !defined($affinity_members->{$cpu_id}); + } + } + + return $host_cpus; +} + +=head2 get_vnuma_vcpu_map + +Returns a hash from virtual NUMA nodes to vCPUs and an optional host NUMA node + +=cut + +my sub get_vnuma_vcpu_map($conf) { + my $map = {}; + my $sockets = $conf->{sockets} // 1; + my $cores = $conf->{cores} // 1; + my $vcpu_count = 0; + + if ($conf->{numa}) { + for (my $i = 0; $i < $PVE::QemuServer::Memory::MAX_NUMA; $i++) { + my $entry = $conf->{"numa$i"} or next; + my $numa = PVE::QemuServer::Memory::parse_numa($entry) or next; + + $map->{$i} = { vcpus => {} }; + for my $cpurange ($numa->{cpus}->@*) { + my ($start, $end) = $cpurange->@*; + for my $cpu (($start .. ($end // $start))) { + $map->{$i}->{vcpus}->{$cpu} = 1; + $vcpu_count++; + } + } + + if (my $hostnodes = $numa->{hostnodes}) { + die "Pinning only available for 1-to-1 NUMA node mapping\n" + if (scalar($hostnodes->@*) > 1 || defined($hostnodes->[0]->[1])); + $map->{$i}->{hostnode} = $hostnodes->[0]->[0]; + } + } + + my $vcpu_maps = scalar(keys $map->%*); + if ($vcpu_maps == 0) { + for my $socket ((0 .. ($sockets - 1))) { + $map->{$socket} = { vcpus => {} }; + for my $cpu ((0 .. ($cores - 1))) { + my $vcpu = $socket * $cores + $cpu; + $map->{$socket}->{vcpus}->{$vcpu} = 1; + $vcpu_count++; + } + } + } + + die "Invalid NUMA configuration for pinning, some vCPUs missing in numa binding\n" + if $vcpu_count != $sockets * $cores; + } else { + # numa not enabled so all are on the same node + $map->{0} = { vcpus => {} }; + for my $i ((0 .. ($sockets * $cores - 1))) { + $map->{0}->{vcpus}->{$i} = 1; + } + } + + return $map; +} + +=head2 get_filtered_host_cpus + +Returns the map of host numa nodes to CPUs after applying the optional affinity from the config. + +=cut + +sub get_filtered_host_cpus($conf) { + my ($host_cpu_count, $host_cpus) = host_numa_node_cpu_list(); + + if (my $affinity = $conf->{affinity}) { + my ($_affinity_count, $affinity_members) = PVE::CpuSet::parse_cpuset($affinity); + $host_cpus = limit_by_affinity($host_cpus, $affinity_members); + } + + return $host_cpus; +} + +=head2 get_vcpu_to_host_numa_map + +Returns a map from vcpus to host numa nodes, and also checks the size constraints currently supported + +=cut + +sub get_vcpu_to_host_numa_map($conf, $host_cpus) { + my $numa_vcpu_map = get_vnuma_vcpu_map($conf); + + my $map = {}; + + if ($conf->{numa}) { + my $used_host_nodes = {}; + + # fill the map of requested host numa nodes with their vcpu count + for my $numa_node (sort keys $numa_vcpu_map->%*) { + my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus}; + my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode}; + if (defined($hostnode)) { + my $vcpu_count = scalar(keys $vcpus->%*); + + $used_host_nodes->{$hostnode} //= 0; + $used_host_nodes->{$hostnode} += $vcpu_count; + } + } + + # check if there are enough host cpus for the vcpus + for my $hostnode (keys $used_host_nodes->%*) { + my $vcpu_count = $used_host_nodes->{$hostnode}; + my $host_cpu_count = scalar(keys $host_cpus->{$hostnode}->%*); + + die "Not enough CPUs available on NUMA node $hostnode\n" + if $host_cpu_count < $vcpu_count; + } + + # try to fit remaining virtual numa nodes to real ones + for my $numa_node (sort keys $numa_vcpu_map->%*) { + my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus}; + my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode}; + if (!defined($hostnode)) { + # try real nodes in ascending order until we find one that fits + # NOTE: this is not the optimal solution as this would probably be NP hard, + # as it's similar to the Bin Packing problem + for my $node (sort keys $host_cpus->%*) { + next if $used_host_nodes->{$node}; + my $host_cpu_count = scalar(keys $host_cpus->{$node}->%*); + my $vcpu_count = scalar(keys $numa_vcpu_map->{$numa_node}->{vcpus}->%*); + next if $host_cpu_count < $vcpu_count; + + $hostnode = $node; + $used_host_nodes->{$hostnode} //= 0; + $used_host_nodes->{$hostnode} += $vcpu_count; + last; + } + + die "Could not find a fitting host NUMA node for guest NUMA node $numa_node\n" + if !defined($hostnode); + $numa_vcpu_map->{$numa_node}->{hostnode} = $hostnode; + } + } + + # now every virtual numa node has a fitting host numa node and we can map from vcpu -> numa node + for my $numa_node (keys $numa_vcpu_map->%*) { + my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus}; + my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode}; + + for my $vcpu (keys $vcpus->%*) { + $map->{$vcpu} = $hostnode; + } + } + } else { + my $vcpus = ($conf->{sockets} // 1) * ($conf->{cores} // 1); + my $host_cpu_count = 0; + for my $node (keys $host_cpus->%*) { + $host_cpu_count += scalar(keys $host_cpus->{$node}->%*); + } + die "not enough available CPUs (limited by affinity) to pin\n" if $vcpus > $host_cpu_count; + # returning empty list means the code can choose any cpu from the available ones + } + return $map; +} + +=head2 choose_single_cpu + +Selects a CPU from the available C<$host_cpus> with the help of the given index +C<$vcpu> and the vCPU to NUMA node map C<$vcpu_map>. + +It modifies the C<$host_cpus> hash to reserve the chosen CPU. + +=cut + +sub choose_single_cpu($vcpu_map, $host_cpus, $vcpu) { + my $hostnode = $vcpu_map->{$vcpu}; + if (!defined($hostnode)) { + # choose a numa node at random + $hostnode = (keys $host_cpus->%*)[0]; + } + + # choose one at random + my $real_cpu = (keys $host_cpus->{$hostnode}->%*)[0]; + delete $host_cpus->{$hostnode}->{$real_cpu}; + if (scalar($host_cpus->{$hostnode}->%*) == 0) { + delete $host_cpus->{$hostnode}; + } + + return $real_cpu; +} + +=head2 get_numa_cpulist + +Returns the list of usable CPUs for the given C<$vcpu> index with the help of +the vCPU to NUMA node map C<$vcpu_map> and the available C<$host_cpus> as a +string usable by the 'taskset' command. + +=cut + +sub get_numa_cpulist($vcpu_map, $host_cpus, $vcpu) { + my $hostnode = $vcpu_map->{$vcpu}; + if (!defined($hostnode)) { + # if there is not explicit mapping, simply don't pin at all + return undef; + } + + return join(',', keys $host_cpus->{$hostnode}->%*); +} + +=head2 assert_pinning_constraints + +Used to verify the constraints from pinnning by trying to construct +the pinning configuration. Useful to check the config before actually starting +the guest. + +=cut + +sub assert_pinning_constraints($conf) { + + my $pinning = $conf->{pinning} // $DEFAULT_VCPU_PINNING; + if ($pinning ne $DEFAULT_VCPU_PINNING) { + my $host_cpus = get_filtered_host_cpus($conf); + get_vcpu_to_host_numa_map($conf, $host_cpus); + } +} + +=head2 pin_threads_to_cpus + +Pins the vCPU threads of a running guest to host CPUs according to the +pinning, affinity and NUMA configuraton. + +Needs the guest to be running, since it querys QMP for the vCPU thread list. + +=cut + +sub pin_threads_to_cpus($conf, $vmid) { + my $pinning = $conf->{pinning} // $DEFAULT_VCPU_PINNING; + if ($pinning ne $DEFAULT_VCPU_PINNING) { + my $host_cpus = get_filtered_host_cpus($conf); + my $vcpu_map = get_vcpu_to_host_numa_map($conf, $host_cpus); + + my $cpuinfo = PVE::QemuServer::Monitor::mon_cmd($vmid, 'query-cpus-fast'); + for my $vcpu ($cpuinfo->@*) { + + my $vcpu_index = $vcpu->{'cpu-index'}; + + my $cpus; + if ($pinning eq 'one-to-one') { + $cpus = choose_single_cpu($vcpu_map, $host_cpus, $vcpu_index); + } elsif ($pinning eq 'numa') { + $cpus = get_numa_cpulist($vcpu_map, $host_cpus, $vcpu_index); + } + + die "no cpus selected for pinning vcpu $vcpu_index\n" + if !defined($cpus); + + my $tid = $vcpu->{'thread-id'}; + print "pinning vcpu $vcpu_index (thread $tid) to cpu(s) $cpus\n"; + run_command( + ['taskset', '-c', '-p', $cpus, $vcpu->{'thread-id'}], logfunc => sub { }, + ); + } + } +} + +1; -- 2.47.3