[RFC PATCH qemu-server] fix #7282: allow (NUMA aware) vCPU pinning

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [RFC PATCH qemu-server] fix #7282: allow (NUMA aware) vCPU pinning
@ 2026-02-17 11:01 Dominik Csapak
  2026-03-12 10:29 ` Dominik Csapak
  0 siblings, 1 reply; 2+ messages in thread
From: Dominik Csapak @ 2026-02-17 11:01 UTC (permalink / raw)
  To: pve-devel

Introduce a new 'pinning' property, which (for now) has two methods for
vCPU pinning:

* numa: pins the vCPUs of each virtual NUMA node to a corresponding host
  NUMA node

* one-to-one: pins each vCPU to a corresponding host CPU, while keeping
  virtual NUMA nodes together on host NUMA nodes

both options respect the 'affinity' setting and the various 'numaX'
setting as good as possible. There are some possible configurations that
are not supported, but these don't make sense from a performance
standpoint (where pinning would be necessary) anyway.

These not supported configurations are when more than one host NUMA node
is assigned to one virtual NUMA node and when more NUMA nodes in the
guest are defined than available on the host, and they're not explicitly
mapped to host NUMA nodes.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---

sending as RFC for now, because I'm still working on implementing some
tests and doing some methodical testing.

preliminary tests show that the pinning (both 'numa' and 'one-to-one',
name pending), show with 'pcm-numa' that the cross numa node ram access
is greatly reduced.

I tested with 'cpubench1a' and a vm with 1 socket 8 cores with default
config  (no numa, no affinity, no pinning) gets ~50M local dram accesses
and ~50M remote dram accesses per second

but with pinning i can see ~130M local dram accesses and <1M remote
ones.

The weird thing is the more specific I pin, the lower the score from the
benchmark gets:

no settings: ~1600
with numa pinning: ~1550
with numa pinning without hyperthreads: ~1450
pin to specific cores (without hyperthreads): ~1450

so I'm still investigating if it's the benchmark or if my assumptions
about performance here are wrong.

I'll test another benchmark too and in my next version i'll include the
results in the commit message.

 src/PVE/QemuServer.pm         |  12 ++
 src/PVE/QemuServer/Makefile   |   1 +
 src/PVE/QemuServer/Pinning.pm | 388 ++++++++++++++++++++++++++++++++++
 3 files changed, 401 insertions(+)
 create mode 100644 src/PVE/QemuServer/Pinning.pm

diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index 5d2dbe03..f055f33f 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -85,6 +85,7 @@ use PVE::QemuServer::Memory qw(get_current_memory);
 use PVE::QemuServer::MetaInfo;
 use PVE::QemuServer::Monitor qw(mon_cmd);
 use PVE::QemuServer::Network;
+use PVE::QemuServer::Pinning;
 use PVE::QemuServer::OVMF;
 use PVE::QemuServer::PCI qw(print_pci_addr print_pcie_addr print_pcie_root_port parse_hostpci);
 use PVE::QemuServer::QemuImage;
@@ -735,6 +736,12 @@ EODESCR
         optional => 1,
         default => 1,
     },
+    pinning => {
+        type => 'string',
+        format => $PVE::QemuServer::Pinning::pinning_fmt,
+        description => "Set pinning options for the guest",
+        optional => 1,
+    },
 };
 
 my $cicustom_fmt = {
@@ -3174,6 +3181,8 @@ sub config_to_command {
         push @$cmd, '/usr/bin/taskset', '--cpu-list', '--all-tasks', $conf->{affinity};
     }
 
+    PVE::QemuServer::Pinning::assert_pinning_constraints($conf);
+
     push @$cmd, $kvm_binary;
 
     push @$cmd, '-id', $vmid;
@@ -5816,6 +5825,9 @@ sub vm_start_nolock {
 
     syslog("info", "VM $vmid started with PID $pid.");
 
+    eval { PVE::QemuServer::Pinning::pin_threads_to_cpus($conf, $vmid) };
+    log_warn("could not pin vCPU threads - $@") if $@;
+
     if (defined(my $migrate = $res->{migrate})) {
         if ($migrate->{proto} eq 'tcp') {
             my $nodename = nodename();
diff --git a/src/PVE/QemuServer/Makefile b/src/PVE/QemuServer/Makefile
index d599ca91..9edcbf6b 100644
--- a/src/PVE/QemuServer/Makefile
+++ b/src/PVE/QemuServer/Makefile
@@ -21,6 +21,7 @@ SOURCES=Agent.pm	\
 	Network.pm	\
 	OVMF.pm		\
 	PCI.pm		\
+	Pinning.pm	\
 	QemuImage.pm	\
 	QMPHelpers.pm	\
 	QSD.pm		\
diff --git a/src/PVE/QemuServer/Pinning.pm b/src/PVE/QemuServer/Pinning.pm
new file mode 100644
index 00000000..c3bc870f
--- /dev/null
+++ b/src/PVE/QemuServer/Pinning.pm
@@ -0,0 +1,388 @@
+package PVE::QemuServer::Pinning;
+
+use v5.36;
+
+use PVE::QemuServer::Memory;
+use PVE::QemuServer::Monitor;
+
+use PVE::Tools qw(dir_glob_foreach run_command);
+
+=head1 NAME
+
+C<PVE::QemuServer::Pinning> - Functions for pinning vCPUs of QEMU guests
+
+=head1 DESCRIPTION
+
+This module contains functions for helping with pinning QEMU guest vCPUS to
+host CPUs, considering other parts of the config like NUMA nodes and affinity.
+
+Before the guest is started C<assert_pinning_constraints> should be called to
+check the pinning constraints.
+
+After the guest is started, C<pin_threads_to_cpus> is called to actually pin
+the vCPU threads to the host CPUs according to the config.
+
+=cut
+
+my $DEFAULT_VCPU_PINNING = "none";
+
+our $pinning_fmt = {
+    'vcpus' => {
+        type => 'string',
+        enum => [qw(one-to-one numa $DEFAULT_VCPU_PINNING)],
+        description => "Set the type of vCPU pinning modes.",
+        verbose_description => <<EODESCR,
+There are multiple ways to pin vcpus to host cores:
+
+* $DEFAULT_VCPU_PINNING (default): either no pinning at all, or when 'affinity'
+  is set, pins all vcpus to the range of given cores by 'affinity'.
+
+* numa: pins the sets of vcpus of each virtual NUMA node, to a corresponding
+  host numa node. This makes memory access consistent for each vcpu, since it
+  won't be rescheduled to a different NUMA node. Takes into account the 'numaX'
+  setting when binding vcpus to host nodes. If 'affinity' is set, only consider
+  those cores.
+
+* one-to-one: tries to pin each vcpu to a specific host core. This prevents
+  vcpus to be rescheduled on other vcpus entirely and can thus reach the most
+  performance, it is the least flexible for the host scheduler. Takes into
+  account the virtual and physical NUMA layout.
+
+This is only supported when there is at least one host NUMA node per virtual
+one, and each manual 'numaX' node is assigned to at most one host NUMA node.
+
+Any of these options won't take into account pinning settings from different
+virtual machines or containers, so to achieve the best and most consistent
+performance, use the combination of 'numaX' and 'affinity' options to make sure
+host cores are not crowded with vcpu assignments.
+
+EODESCR
+        default => 'none',
+        default_key => 1,
+    },
+};
+
+=head2 Restrictions and Constraints
+
+To keep it simple, pinning is limited to a subset of possible NUMA
+configurations. For example, with 'numaX' configs, it's possible to create
+overlapping virtual NUMA nodes, e.g. assigning vCPUs 0-3 to host NUMA nodes 0-1
+and vCPUs 4-7 to host NUMA nodes 1-2. To keep the pinning logic simple, such
+configurations are not supported. Instead the user should assign virtual NUMA
+nodes only to a single host NUMA node.
+
+=cut
+
+=head2 host_numa_node_cpu_list
+
+returns a map of host numa nodes to cpus
+
+=cut
+
+my sub host_numa_node_cpu_list {
+    my $base_path = "/sys/devices/system/node/";
+
+    my $map = {};
+    my $count = 0;
+
+    dir_glob_foreach(
+        $base_path,
+        'node(\d+)',
+        sub {
+            my ($fullnode, $nodeid) = @_;
+            opendir(my $dirfd, "$base_path/$fullnode")
+                || die "cannot open numa node dir $fullnode\n";
+            # this is a hash so we can use the key randomness for selecting
+            my $cpus = {};
+            for my $cpu (readdir($dirfd)) {
+                if ($cpu =~ m/^cpu(\d+)$/) {
+                    $cpus->{$1} = 1;
+                    $count++;
+                }
+            }
+            closedir($dirfd);
+
+            $map->{$nodeid} = $cpus;
+        },
+    );
+
+    return ($count, $map);
+}
+
+=head2 limit_by_affinity
+
+limits a list of map of numa nodes to cpu by a the given affinity map
+
+=cut
+
+my sub limit_by_affinity($host_cpus, $affinity_members) {
+    return $host_cpus if !$affinity_members || scalar($affinity_members->%*) == 0;
+
+    for my $node_id (keys $host_cpus->%*) {
+        my $node = $host_cpus->{$node_id};
+        for my $cpu_id (keys $node->%*) {
+            delete $node->{$cpu_id} if !defined($affinity_members->{$cpu_id});
+        }
+    }
+
+    return $host_cpus;
+}
+
+=head2 get_vnuma_vcpu_map
+
+Returns a hash from virtual NUMA nodes to vCPUs and an optional host NUMA node
+
+=cut
+
+my sub get_vnuma_vcpu_map($conf) {
+    my $map = {};
+    my $sockets = $conf->{sockets} // 1;
+    my $cores = $conf->{cores} // 1;
+    my $vcpu_count = 0;
+
+    if ($conf->{numa}) {
+        for (my $i = 0; $i < $PVE::QemuServer::Memory::MAX_NUMA; $i++) {
+            my $entry = $conf->{"numa$i"} or next;
+            my $numa = PVE::QemuServer::Memory::parse_numa($entry) or next;
+
+            $map->{$i} = { vcpus => {} };
+            for my $cpurange ($numa->{cpus}->@*) {
+                my ($start, $end) = $cpurange->@*;
+                for my $cpu (($start .. ($end // $start))) {
+                    $map->{$i}->{vcpus}->{$cpu} = 1;
+                    $vcpu_count++;
+                }
+            }
+
+            if (my $hostnodes = $numa->{hostnodes}) {
+                die "Pinning only available for 1-to-1 NUMA node mapping\n"
+                    if (scalar($hostnodes->@*) > 1 || defined($hostnodes->[0]->[1]));
+                $map->{$i}->{hostnode} = $hostnodes->[0]->[0];
+            }
+        }
+
+        my $vcpu_maps = scalar(keys $map->%*);
+        if ($vcpu_maps == 0) {
+            for my $socket ((0 .. ($sockets - 1))) {
+                $map->{$socket} = { vcpus => {} };
+                for my $cpu ((0 .. ($cores - 1))) {
+                    my $vcpu = $socket * $cores + $cpu;
+                    $map->{$socket}->{vcpus}->{$vcpu} = 1;
+                    $vcpu_count++;
+                }
+            }
+        }
+
+        die "Invalid NUMA configuration for pinning, some vCPUs missing in numa binding\n"
+            if $vcpu_count != $sockets * $cores;
+    } else {
+        # numa not enabled so all are on the same node
+        $map->{0} = { vcpus => {} };
+        for my $i ((0 .. ($sockets * $cores - 1))) {
+            $map->{0}->{vcpus}->{$i} = 1;
+        }
+    }
+
+    return $map;
+}
+
+=head2 get_filtered_host_cpus
+
+Returns the map of host numa nodes to CPUs after applying the optional affinity from the config.
+
+=cut
+
+sub get_filtered_host_cpus($conf) {
+    my ($host_cpu_count, $host_cpus) = host_numa_node_cpu_list();
+
+    if (my $affinity = $conf->{affinity}) {
+        my ($_affinity_count, $affinity_members) = PVE::CpuSet::parse_cpuset($affinity);
+        $host_cpus = limit_by_affinity($host_cpus, $affinity_members);
+    }
+
+    return $host_cpus;
+}
+
+=head2 get_vcpu_to_host_numa_map
+
+Returns a map from vcpus to host numa nodes, and also checks the size constraints currently supported
+
+=cut
+
+sub get_vcpu_to_host_numa_map($conf, $host_cpus) {
+    my $numa_vcpu_map = get_vnuma_vcpu_map($conf);
+
+    my $map = {};
+
+    if ($conf->{numa}) {
+        my $used_host_nodes = {};
+
+        # fill the map of requested host numa nodes with their vcpu count
+        for my $numa_node (sort keys $numa_vcpu_map->%*) {
+            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
+            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
+            if (defined($hostnode)) {
+                my $vcpu_count = scalar(keys $vcpus->%*);
+
+                $used_host_nodes->{$hostnode} //= 0;
+                $used_host_nodes->{$hostnode} += $vcpu_count;
+            }
+        }
+
+        # check if there are enough host cpus for the vcpus
+        for my $hostnode (keys $used_host_nodes->%*) {
+            my $vcpu_count = $used_host_nodes->{$hostnode};
+            my $host_cpu_count = scalar(keys $host_cpus->{$hostnode}->%*);
+
+            die "Not enough CPUs available on NUMA node $hostnode\n"
+                if $host_cpu_count < $vcpu_count;
+        }
+
+        # try to fit remaining virtual numa nodes to real ones
+        for my $numa_node (sort keys $numa_vcpu_map->%*) {
+            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
+            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
+            if (!defined($hostnode)) {
+                # try real nodes in ascending order until we find one that fits
+                # NOTE: this is not the optimal solution as this would probably be NP hard,
+                # as it's similar to the Bin Packing problem
+                for my $node (sort keys $host_cpus->%*) {
+                    next if $used_host_nodes->{$node};
+                    my $host_cpu_count = scalar(keys $host_cpus->{$node}->%*);
+                    my $vcpu_count = scalar(keys $numa_vcpu_map->{$numa_node}->{vcpus}->%*);
+                    next if $host_cpu_count < $vcpu_count;
+
+                    $hostnode = $node;
+                    $used_host_nodes->{$hostnode} //= 0;
+                    $used_host_nodes->{$hostnode} += $vcpu_count;
+                    last;
+                }
+
+                die "Could not find a fitting host NUMA node for guest NUMA node $numa_node\n"
+                    if !defined($hostnode);
+                $numa_vcpu_map->{$numa_node}->{hostnode} = $hostnode;
+            }
+        }
+
+        # now every virtual numa node has a fitting host numa node and we can map from vcpu -> numa node
+        for my $numa_node (keys $numa_vcpu_map->%*) {
+            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
+            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
+
+            for my $vcpu (keys $vcpus->%*) {
+                $map->{$vcpu} = $hostnode;
+            }
+        }
+    } else {
+        my $vcpus = ($conf->{sockets} // 1) * ($conf->{cores} // 1);
+        my $host_cpu_count = 0;
+        for my $node (keys $host_cpus->%*) {
+            $host_cpu_count += scalar(keys $host_cpus->{$node}->%*);
+        }
+        die "not enough available CPUs (limited by affinity) to pin\n" if $vcpus > $host_cpu_count;
+        # returning empty list means the code can choose any cpu from the available ones
+    }
+    return $map;
+}
+
+=head2 choose_single_cpu
+
+Selects a CPU from the available C<$host_cpus> with the help of the given index
+C<$vcpu> and the vCPU to NUMA node map C<$vcpu_map>.
+
+It modifies the C<$host_cpus> hash to reserve the chosen CPU.
+
+=cut
+
+sub choose_single_cpu($vcpu_map, $host_cpus, $vcpu) {
+    my $hostnode = $vcpu_map->{$vcpu};
+    if (!defined($hostnode)) {
+        # choose a numa node at random
+        $hostnode = (keys $host_cpus->%*)[0];
+    }
+
+    # choose one at random
+    my $real_cpu = (keys $host_cpus->{$hostnode}->%*)[0];
+    delete $host_cpus->{$hostnode}->{$real_cpu};
+    if (scalar($host_cpus->{$hostnode}->%*) == 0) {
+        delete $host_cpus->{$hostnode};
+    }
+
+    return $real_cpu;
+}
+
+=head2 get_numa_cpulist
+
+Returns the list of usable CPUs for the given C<$vcpu> index with the help of
+the vCPU to NUMA node map C<$vcpu_map> and the available C<$host_cpus> as a
+string usable by the 'taskset' command.
+
+=cut
+
+sub get_numa_cpulist($vcpu_map, $host_cpus, $vcpu) {
+    my $hostnode = $vcpu_map->{$vcpu};
+    if (!defined($hostnode)) {
+        # if there is not explicit mapping, simply don't pin at all
+        return undef;
+    }
+
+    return join(',', keys $host_cpus->{$hostnode}->%*);
+}
+
+=head2 assert_pinning_constraints
+
+Used to verify the constraints from pinnning by trying to construct
+the pinning configuration. Useful to check the config before actually starting
+the guest.
+
+=cut
+
+sub assert_pinning_constraints($conf) {
+
+    my $pinning = $conf->{pinning} // $DEFAULT_VCPU_PINNING;
+    if ($pinning ne $DEFAULT_VCPU_PINNING) {
+        my $host_cpus = get_filtered_host_cpus($conf);
+        get_vcpu_to_host_numa_map($conf, $host_cpus);
+    }
+}
+
+=head2 pin_threads_to_cpus
+
+Pins the vCPU threads of a running guest to host CPUs according to the
+pinning, affinity and NUMA configuraton.
+
+Needs the guest to be running, since it querys QMP for the vCPU thread list.
+
+=cut
+
+sub pin_threads_to_cpus($conf, $vmid) {
+    my $pinning = $conf->{pinning} // $DEFAULT_VCPU_PINNING;
+    if ($pinning ne $DEFAULT_VCPU_PINNING) {
+        my $host_cpus = get_filtered_host_cpus($conf);
+        my $vcpu_map = get_vcpu_to_host_numa_map($conf, $host_cpus);
+
+        my $cpuinfo = PVE::QemuServer::Monitor::mon_cmd($vmid, 'query-cpus-fast');
+        for my $vcpu ($cpuinfo->@*) {
+
+            my $vcpu_index = $vcpu->{'cpu-index'};
+
+            my $cpus;
+            if ($pinning eq 'one-to-one') {
+                $cpus = choose_single_cpu($vcpu_map, $host_cpus, $vcpu_index);
+            } elsif ($pinning eq 'numa') {
+                $cpus = get_numa_cpulist($vcpu_map, $host_cpus, $vcpu_index);
+            }
+
+            die "no cpus selected for pinning vcpu $vcpu_index\n"
+                if !defined($cpus);
+
+            my $tid = $vcpu->{'thread-id'};
+            print "pinning vcpu $vcpu_index (thread $tid) to cpu(s) $cpus\n";
+            run_command(
+                ['taskset', '-c', '-p', $cpus, $vcpu->{'thread-id'}], logfunc => sub { },
+            );
+        }
+    }
+}
+
+1;
-- 
2.47.3





^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [RFC PATCH qemu-server] fix #7282: allow (NUMA aware) vCPU pinning
  2026-02-17 11:01 [RFC PATCH qemu-server] fix #7282: allow (NUMA aware) vCPU pinning Dominik Csapak
@ 2026-03-12 10:29 ` Dominik Csapak
  0 siblings, 0 replies; 2+ messages in thread
From: Dominik Csapak @ 2026-03-12 10:29 UTC (permalink / raw)
  To: pve-devel

I did try to do a few benchmarks methodically but the results are
not really what was expected.

Either my methodology is wrong (wrong benchmarks, etc.) or
the platform where I tested is actually not suited for this kind of testing.

Still writing it down, so maybe someone else can give feedback ;)

The platform was an 1st generation EPYC 7351P 16 Core processor with
32 threads (not the fastes, but what I had on hand). It was configured
with 64 GB DDR4 memory with 2667MT/s (4x16GB sticks)

This CPU has 4 NUMA nodes:
numactl -H output:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 16 17 18 19
node 0 size: 15936 MB
node 0 free: 14110 MB
node 1 cpus: 4 5 6 7 20 21 22 23
node 1 size: 16124 MB
node 1 free: 14142 MB
node 2 cpus: 8 9 10 11 24 25 26 27
node 2 size: 16124 MB
node 2 free: 15210 MB
node 3 cpus: 12 13 14 15 28 29 30 31
node 3 size: 16080 MB
node 3 free: 13467 MB
node distances:
node     0    1    2    3
    0:   10   16   16   16
    1:   16   10   16   16
    2:   16   16   10   16
    3:   16   16   16   10

I ran multiple benchmarks:

* cpubench1a, which tries to be a general cpu benchmark that
   covers a wide array of sub-tests and claims to be NUMA aware[0]
   has a single core and multi-core score.
* sysbench in cpu mode (sysbench cpu run)
* sysbench in default memory mode (sysbench memory run)
* sysbench in memory mode with higher block/total size
   (sysbench memory run --time=60 --memory-total-size=500G 
--memory-block-size=4M)

This was run in a virtual machine (which was the only thing running
on that host at the time) with 8GB memory with following 
cpu/numa/pinning options:

* A: 1 sockets, 8 cores, no numa, no pinnng
* B: 4 sockets, 2 cores, no numa, no pinning
* C: 4 sockets, 2 cores, numa=1, no pinning
* D: 4 sockets, 2 cores, numa=1, manual numa binding to the correct 
hostnodes with numaX settings, no pinning
* E: same as D, but with pinning=numa set
* F: same as D, but with pinning=one-to-one set

the cpubench1a tests look as follows (higher is better):

configuration  1core median  1core maximum  mc median   mc maximum
A              103.348458    103.949064     784.815770  785.258851
B              104.730286    105.583773     782.621719  785.373313
C              103.333099    103.933290     783.563512  786.722700
D              103.230861    104.400941     783.705615  785.707556
E              103.150615    103.266333     787.566320  788.102865
F              103.230200    104.069506     783.157062  784.127415

so for the single core case, it did not make any consistent difference.
any change here is probably due to cpu thread scheduling from the host

on the multicore side, there is a tendency for the numa pinning mode
to be slightly better, but the change from A to E is only 0.3%.

Can't really say why the one-to-one pinning case is a bit lower,
maybe because my code does not look if the target core is a
'real' core or a hyperthread, would have to do more testing on this.

similar story with the sysbench cpu latency (numbers in ms)

configuration  average  maximum  95th percentile
A              0.70     1.09     0.72
B              0.70     1.16     0.72
C              0.70     1.75     0.72
D              0.70     1.80     0.72
E              0.70     1.12     0.72
F              0.70     2.37     0.72

i don't think the maximum here has any real meaning since the 95th
percentile is the same in all tests

Sysbench memory latency results (also in ms)

configuration  1k avg  1k max  1k 95th  4m avg  4m max  4m 95th
A              0.00    0.55    0.00     0.22    16.74   0.23
B              0.00    0.93    0.00     0.22    11.08   0.24
C              0.00    0.36    0.00     0.22    7.35    0.23
D              0.00    1.08    0.00     0.22    2.07    0.23
E              0.00    0.29    0.00     0.23    2.26    0.24
F              0.00    0.31    0.00     0.22    2.63    0.23

here the '4M max' part can seemingly reduce the outliers
by correctly configuring numa, but the pinning had a detrimental
effect and it was not enough to impact the 95th percentile in a
meaningful way.

since i now have some automation for these things here,
i can retry on a different platform to see if it makes a difference.

if anybody notices something off here, or has some tips
on e.g. what benchmarks to use, please do tell ;)

also what i did not test yet, is if it makes a difference
when passing through pci devices that are bound to a specific
numa node, but I'm working on it and see if I can try to benchmark that too.

0: https://github.com/AmadeusITGroup/cpubench1A

On 2/17/26 12:48 PM, Dominik Csapak wrote:
> Introduce a new 'pinning' property, which (for now) has two methods for
> vCPU pinning:
> 
> * numa: pins the vCPUs of each virtual NUMA node to a corresponding host
>    NUMA node
> 
> * one-to-one: pins each vCPU to a corresponding host CPU, while keeping
>    virtual NUMA nodes together on host NUMA nodes
> 
> both options respect the 'affinity' setting and the various 'numaX'
> setting as good as possible. There are some possible configurations that
> are not supported, but these don't make sense from a performance
> standpoint (where pinning would be necessary) anyway.
> 
> These not supported configurations are when more than one host NUMA node
> is assigned to one virtual NUMA node and when more NUMA nodes in the
> guest are defined than available on the host, and they're not explicitly
> mapped to host NUMA nodes.
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> 
> sending as RFC for now, because I'm still working on implementing some
> tests and doing some methodical testing.
> 
> preliminary tests show that the pinning (both 'numa' and 'one-to-one',
> name pending), show with 'pcm-numa' that the cross numa node ram access
> is greatly reduced.
> 
> I tested with 'cpubench1a' and a vm with 1 socket 8 cores with default
> config  (no numa, no affinity, no pinning) gets ~50M local dram accesses
> and ~50M remote dram accesses per second
> 
> but with pinning i can see ~130M local dram accesses and <1M remote
> ones.
> 
> The weird thing is the more specific I pin, the lower the score from the
> benchmark gets:
> 
> no settings: ~1600
> with numa pinning: ~1550
> with numa pinning without hyperthreads: ~1450
> pin to specific cores (without hyperthreads): ~1450
> 
> so I'm still investigating if it's the benchmark or if my assumptions
> about performance here are wrong.
> 
> I'll test another benchmark too and in my next version i'll include the
> results in the commit message.
> 
>   src/PVE/QemuServer.pm         |  12 ++
>   src/PVE/QemuServer/Makefile   |   1 +
>   src/PVE/QemuServer/Pinning.pm | 388 ++++++++++++++++++++++++++++++++++
>   3 files changed, 401 insertions(+)
>   create mode 100644 src/PVE/QemuServer/Pinning.pm
> 
> diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
> index 5d2dbe03..f055f33f 100644
> --- a/src/PVE/QemuServer.pm
> +++ b/src/PVE/QemuServer.pm
> @@ -85,6 +85,7 @@ use PVE::QemuServer::Memory qw(get_current_memory);
>   use PVE::QemuServer::MetaInfo;
>   use PVE::QemuServer::Monitor qw(mon_cmd);
>   use PVE::QemuServer::Network;
> +use PVE::QemuServer::Pinning;
>   use PVE::QemuServer::OVMF;
>   use PVE::QemuServer::PCI qw(print_pci_addr print_pcie_addr print_pcie_root_port parse_hostpci);
>   use PVE::QemuServer::QemuImage;
> @@ -735,6 +736,12 @@ EODESCR
>           optional => 1,
>           default => 1,
>       },
> +    pinning => {
> +        type => 'string',
> +        format => $PVE::QemuServer::Pinning::pinning_fmt,
> +        description => "Set pinning options for the guest",
> +        optional => 1,
> +    },
>   };
>   
>   my $cicustom_fmt = {
> @@ -3174,6 +3181,8 @@ sub config_to_command {
>           push @$cmd, '/usr/bin/taskset', '--cpu-list', '--all-tasks', $conf->{affinity};
>       }
>   
> +    PVE::QemuServer::Pinning::assert_pinning_constraints($conf);
> +
>       push @$cmd, $kvm_binary;
>   
>       push @$cmd, '-id', $vmid;
> @@ -5816,6 +5825,9 @@ sub vm_start_nolock {
>   
>       syslog("info", "VM $vmid started with PID $pid.");
>   
> +    eval { PVE::QemuServer::Pinning::pin_threads_to_cpus($conf, $vmid) };
> +    log_warn("could not pin vCPU threads - $@") if $@;
> +
>       if (defined(my $migrate = $res->{migrate})) {
>           if ($migrate->{proto} eq 'tcp') {
>               my $nodename = nodename();
> diff --git a/src/PVE/QemuServer/Makefile b/src/PVE/QemuServer/Makefile
> index d599ca91..9edcbf6b 100644
> --- a/src/PVE/QemuServer/Makefile
> +++ b/src/PVE/QemuServer/Makefile
> @@ -21,6 +21,7 @@ SOURCES=Agent.pm	\
>   	Network.pm	\
>   	OVMF.pm		\
>   	PCI.pm		\
> +	Pinning.pm	\
>   	QemuImage.pm	\
>   	QMPHelpers.pm	\
>   	QSD.pm		\
> diff --git a/src/PVE/QemuServer/Pinning.pm b/src/PVE/QemuServer/Pinning.pm
> new file mode 100644
> index 00000000..c3bc870f
> --- /dev/null
> +++ b/src/PVE/QemuServer/Pinning.pm
> @@ -0,0 +1,388 @@
> +package PVE::QemuServer::Pinning;
> +
> +use v5.36;
> +
> +use PVE::QemuServer::Memory;
> +use PVE::QemuServer::Monitor;
> +
> +use PVE::Tools qw(dir_glob_foreach run_command);
> +
> +=head1 NAME
> +
> +C<PVE::QemuServer::Pinning> - Functions for pinning vCPUs of QEMU guests
> +
> +=head1 DESCRIPTION
> +
> +This module contains functions for helping with pinning QEMU guest vCPUS to
> +host CPUs, considering other parts of the config like NUMA nodes and affinity.
> +
> +Before the guest is started C<assert_pinning_constraints> should be called to
> +check the pinning constraints.
> +
> +After the guest is started, C<pin_threads_to_cpus> is called to actually pin
> +the vCPU threads to the host CPUs according to the config.
> +
> +=cut
> +
> +my $DEFAULT_VCPU_PINNING = "none";
> +
> +our $pinning_fmt = {
> +    'vcpus' => {
> +        type => 'string',
> +        enum => [qw(one-to-one numa $DEFAULT_VCPU_PINNING)],
> +        description => "Set the type of vCPU pinning modes.",
> +        verbose_description => <<EODESCR,
> +There are multiple ways to pin vcpus to host cores:
> +
> +* $DEFAULT_VCPU_PINNING (default): either no pinning at all, or when 'affinity'
> +  is set, pins all vcpus to the range of given cores by 'affinity'.
> +
> +* numa: pins the sets of vcpus of each virtual NUMA node, to a corresponding
> +  host numa node. This makes memory access consistent for each vcpu, since it
> +  won't be rescheduled to a different NUMA node. Takes into account the 'numaX'
> +  setting when binding vcpus to host nodes. If 'affinity' is set, only consider
> +  those cores.
> +
> +* one-to-one: tries to pin each vcpu to a specific host core. This prevents
> +  vcpus to be rescheduled on other vcpus entirely and can thus reach the most
> +  performance, it is the least flexible for the host scheduler. Takes into
> +  account the virtual and physical NUMA layout.
> +
> +This is only supported when there is at least one host NUMA node per virtual
> +one, and each manual 'numaX' node is assigned to at most one host NUMA node.
> +
> +Any of these options won't take into account pinning settings from different
> +virtual machines or containers, so to achieve the best and most consistent
> +performance, use the combination of 'numaX' and 'affinity' options to make sure
> +host cores are not crowded with vcpu assignments.
> +
> +EODESCR
> +        default => 'none',
> +        default_key => 1,
> +    },
> +};
> +
> +=head2 Restrictions and Constraints
> +
> +To keep it simple, pinning is limited to a subset of possible NUMA
> +configurations. For example, with 'numaX' configs, it's possible to create
> +overlapping virtual NUMA nodes, e.g. assigning vCPUs 0-3 to host NUMA nodes 0-1
> +and vCPUs 4-7 to host NUMA nodes 1-2. To keep the pinning logic simple, such
> +configurations are not supported. Instead the user should assign virtual NUMA
> +nodes only to a single host NUMA node.
> +
> +=cut
> +
> +=head2 host_numa_node_cpu_list
> +
> +returns a map of host numa nodes to cpus
> +
> +=cut
> +
> +my sub host_numa_node_cpu_list {
> +    my $base_path = "/sys/devices/system/node/";
> +
> +    my $map = {};
> +    my $count = 0;
> +
> +    dir_glob_foreach(
> +        $base_path,
> +        'node(\d+)',
> +        sub {
> +            my ($fullnode, $nodeid) = @_;
> +            opendir(my $dirfd, "$base_path/$fullnode")
> +                || die "cannot open numa node dir $fullnode\n";
> +            # this is a hash so we can use the key randomness for selecting
> +            my $cpus = {};
> +            for my $cpu (readdir($dirfd)) {
> +                if ($cpu =~ m/^cpu(\d+)$/) {
> +                    $cpus->{$1} = 1;
> +                    $count++;
> +                }
> +            }
> +            closedir($dirfd);
> +
> +            $map->{$nodeid} = $cpus;
> +        },
> +    );
> +
> +    return ($count, $map);
> +}
> +
> +=head2 limit_by_affinity
> +
> +limits a list of map of numa nodes to cpu by a the given affinity map
> +
> +=cut
> +
> +my sub limit_by_affinity($host_cpus, $affinity_members) {
> +    return $host_cpus if !$affinity_members || scalar($affinity_members->%*) == 0;
> +
> +    for my $node_id (keys $host_cpus->%*) {
> +        my $node = $host_cpus->{$node_id};
> +        for my $cpu_id (keys $node->%*) {
> +            delete $node->{$cpu_id} if !defined($affinity_members->{$cpu_id});
> +        }
> +    }
> +
> +    return $host_cpus;
> +}
> +
> +=head2 get_vnuma_vcpu_map
> +
> +Returns a hash from virtual NUMA nodes to vCPUs and an optional host NUMA node
> +
> +=cut
> +
> +my sub get_vnuma_vcpu_map($conf) {
> +    my $map = {};
> +    my $sockets = $conf->{sockets} // 1;
> +    my $cores = $conf->{cores} // 1;
> +    my $vcpu_count = 0;
> +
> +    if ($conf->{numa}) {
> +        for (my $i = 0; $i < $PVE::QemuServer::Memory::MAX_NUMA; $i++) {
> +            my $entry = $conf->{"numa$i"} or next;
> +            my $numa = PVE::QemuServer::Memory::parse_numa($entry) or next;
> +
> +            $map->{$i} = { vcpus => {} };
> +            for my $cpurange ($numa->{cpus}->@*) {
> +                my ($start, $end) = $cpurange->@*;
> +                for my $cpu (($start .. ($end // $start))) {
> +                    $map->{$i}->{vcpus}->{$cpu} = 1;
> +                    $vcpu_count++;
> +                }
> +            }
> +
> +            if (my $hostnodes = $numa->{hostnodes}) {
> +                die "Pinning only available for 1-to-1 NUMA node mapping\n"
> +                    if (scalar($hostnodes->@*) > 1 || defined($hostnodes->[0]->[1]));
> +                $map->{$i}->{hostnode} = $hostnodes->[0]->[0];
> +            }
> +        }
> +
> +        my $vcpu_maps = scalar(keys $map->%*);
> +        if ($vcpu_maps == 0) {
> +            for my $socket ((0 .. ($sockets - 1))) {
> +                $map->{$socket} = { vcpus => {} };
> +                for my $cpu ((0 .. ($cores - 1))) {
> +                    my $vcpu = $socket * $cores + $cpu;
> +                    $map->{$socket}->{vcpus}->{$vcpu} = 1;
> +                    $vcpu_count++;
> +                }
> +            }
> +        }
> +
> +        die "Invalid NUMA configuration for pinning, some vCPUs missing in numa binding\n"
> +            if $vcpu_count != $sockets * $cores;
> +    } else {
> +        # numa not enabled so all are on the same node
> +        $map->{0} = { vcpus => {} };
> +        for my $i ((0 .. ($sockets * $cores - 1))) {
> +            $map->{0}->{vcpus}->{$i} = 1;
> +        }
> +    }
> +
> +    return $map;
> +}
> +
> +=head2 get_filtered_host_cpus
> +
> +Returns the map of host numa nodes to CPUs after applying the optional affinity from the config.
> +
> +=cut
> +
> +sub get_filtered_host_cpus($conf) {
> +    my ($host_cpu_count, $host_cpus) = host_numa_node_cpu_list();
> +
> +    if (my $affinity = $conf->{affinity}) {
> +        my ($_affinity_count, $affinity_members) = PVE::CpuSet::parse_cpuset($affinity);
> +        $host_cpus = limit_by_affinity($host_cpus, $affinity_members);
> +    }
> +
> +    return $host_cpus;
> +}
> +
> +=head2 get_vcpu_to_host_numa_map
> +
> +Returns a map from vcpus to host numa nodes, and also checks the size constraints currently supported
> +
> +=cut
> +
> +sub get_vcpu_to_host_numa_map($conf, $host_cpus) {
> +    my $numa_vcpu_map = get_vnuma_vcpu_map($conf);
> +
> +    my $map = {};
> +
> +    if ($conf->{numa}) {
> +        my $used_host_nodes = {};
> +
> +        # fill the map of requested host numa nodes with their vcpu count
> +        for my $numa_node (sort keys $numa_vcpu_map->%*) {
> +            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
> +            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
> +            if (defined($hostnode)) {
> +                my $vcpu_count = scalar(keys $vcpus->%*);
> +
> +                $used_host_nodes->{$hostnode} //= 0;
> +                $used_host_nodes->{$hostnode} += $vcpu_count;
> +            }
> +        }
> +
> +        # check if there are enough host cpus for the vcpus
> +        for my $hostnode (keys $used_host_nodes->%*) {
> +            my $vcpu_count = $used_host_nodes->{$hostnode};
> +            my $host_cpu_count = scalar(keys $host_cpus->{$hostnode}->%*);
> +
> +            die "Not enough CPUs available on NUMA node $hostnode\n"
> +                if $host_cpu_count < $vcpu_count;
> +        }
> +
> +        # try to fit remaining virtual numa nodes to real ones
> +        for my $numa_node (sort keys $numa_vcpu_map->%*) {
> +            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
> +            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
> +            if (!defined($hostnode)) {
> +                # try real nodes in ascending order until we find one that fits
> +                # NOTE: this is not the optimal solution as this would probably be NP hard,
> +                # as it's similar to the Bin Packing problem
> +                for my $node (sort keys $host_cpus->%*) {
> +                    next if $used_host_nodes->{$node};
> +                    my $host_cpu_count = scalar(keys $host_cpus->{$node}->%*);
> +                    my $vcpu_count = scalar(keys $numa_vcpu_map->{$numa_node}->{vcpus}->%*);
> +                    next if $host_cpu_count < $vcpu_count;
> +
> +                    $hostnode = $node;
> +                    $used_host_nodes->{$hostnode} //= 0;
> +                    $used_host_nodes->{$hostnode} += $vcpu_count;
> +                    last;
> +                }
> +
> +                die "Could not find a fitting host NUMA node for guest NUMA node $numa_node\n"
> +                    if !defined($hostnode);
> +                $numa_vcpu_map->{$numa_node}->{hostnode} = $hostnode;
> +            }
> +        }
> +
> +        # now every virtual numa node has a fitting host numa node and we can map from vcpu -> numa node
> +        for my $numa_node (keys $numa_vcpu_map->%*) {
> +            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
> +            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
> +
> +            for my $vcpu (keys $vcpus->%*) {
> +                $map->{$vcpu} = $hostnode;
> +            }
> +        }
> +    } else {
> +        my $vcpus = ($conf->{sockets} // 1) * ($conf->{cores} // 1);
> +        my $host_cpu_count = 0;
> +        for my $node (keys $host_cpus->%*) {
> +            $host_cpu_count += scalar(keys $host_cpus->{$node}->%*);
> +        }
> +        die "not enough available CPUs (limited by affinity) to pin\n" if $vcpus > $host_cpu_count;
> +        # returning empty list means the code can choose any cpu from the available ones
> +    }
> +    return $map;
> +}
> +
> +=head2 choose_single_cpu
> +
> +Selects a CPU from the available C<$host_cpus> with the help of the given index
> +C<$vcpu> and the vCPU to NUMA node map C<$vcpu_map>.
> +
> +It modifies the C<$host_cpus> hash to reserve the chosen CPU.
> +
> +=cut
> +
> +sub choose_single_cpu($vcpu_map, $host_cpus, $vcpu) {
> +    my $hostnode = $vcpu_map->{$vcpu};
> +    if (!defined($hostnode)) {
> +        # choose a numa node at random
> +        $hostnode = (keys $host_cpus->%*)[0];
> +    }
> +
> +    # choose one at random
> +    my $real_cpu = (keys $host_cpus->{$hostnode}->%*)[0];
> +    delete $host_cpus->{$hostnode}->{$real_cpu};
> +    if (scalar($host_cpus->{$hostnode}->%*) == 0) {
> +        delete $host_cpus->{$hostnode};
> +    }
> +
> +    return $real_cpu;
> +}
> +
> +=head2 get_numa_cpulist
> +
> +Returns the list of usable CPUs for the given C<$vcpu> index with the help of
> +the vCPU to NUMA node map C<$vcpu_map> and the available C<$host_cpus> as a
> +string usable by the 'taskset' command.
> +
> +=cut
> +
> +sub get_numa_cpulist($vcpu_map, $host_cpus, $vcpu) {
> +    my $hostnode = $vcpu_map->{$vcpu};
> +    if (!defined($hostnode)) {
> +        # if there is not explicit mapping, simply don't pin at all
> +        return undef;
> +    }
> +
> +    return join(',', keys $host_cpus->{$hostnode}->%*);
> +}
> +
> +=head2 assert_pinning_constraints
> +
> +Used to verify the constraints from pinnning by trying to construct
> +the pinning configuration. Useful to check the config before actually starting
> +the guest.
> +
> +=cut
> +
> +sub assert_pinning_constraints($conf) {
> +
> +    my $pinning = $conf->{pinning} // $DEFAULT_VCPU_PINNING;
> +    if ($pinning ne $DEFAULT_VCPU_PINNING) {
> +        my $host_cpus = get_filtered_host_cpus($conf);
> +        get_vcpu_to_host_numa_map($conf, $host_cpus);
> +    }
> +}
> +
> +=head2 pin_threads_to_cpus
> +
> +Pins the vCPU threads of a running guest to host CPUs according to the
> +pinning, affinity and NUMA configuraton.
> +
> +Needs the guest to be running, since it querys QMP for the vCPU thread list.
> +
> +=cut
> +
> +sub pin_threads_to_cpus($conf, $vmid) {
> +    my $pinning = $conf->{pinning} // $DEFAULT_VCPU_PINNING;
> +    if ($pinning ne $DEFAULT_VCPU_PINNING) {
> +        my $host_cpus = get_filtered_host_cpus($conf);
> +        my $vcpu_map = get_vcpu_to_host_numa_map($conf, $host_cpus);
> +
> +        my $cpuinfo = PVE::QemuServer::Monitor::mon_cmd($vmid, 'query-cpus-fast');
> +        for my $vcpu ($cpuinfo->@*) {
> +
> +            my $vcpu_index = $vcpu->{'cpu-index'};
> +
> +            my $cpus;
> +            if ($pinning eq 'one-to-one') {
> +                $cpus = choose_single_cpu($vcpu_map, $host_cpus, $vcpu_index);
> +            } elsif ($pinning eq 'numa') {
> +                $cpus = get_numa_cpulist($vcpu_map, $host_cpus, $vcpu_index);
> +            }
> +
> +            die "no cpus selected for pinning vcpu $vcpu_index\n"
> +                if !defined($cpus);
> +
> +            my $tid = $vcpu->{'thread-id'};
> +            print "pinning vcpu $vcpu_index (thread $tid) to cpu(s) $cpus\n";
> +            run_command(
> +                ['taskset', '-c', '-p', $cpus, $vcpu->{'thread-id'}], logfunc => sub { },
> +            );
> +        }
> +    }
> +}
> +
> +1;





^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-03-12 10:30 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-17 11:01 [RFC PATCH qemu-server] fix #7282: allow (NUMA aware) vCPU pinning Dominik Csapak
2026-03-12 10:29 ` Dominik Csapak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal