From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 920AF1FF137
	for <inbox@lore.proxmox.com>; Tue, 17 Feb 2026 12:48:00 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 9C5C43500D;
	Tue, 17 Feb 2026 12:48:50 +0100 (CET)
From: Dominik Csapak <d.csapak@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [RFC PATCH qemu-server] fix #7282: allow (NUMA aware) vCPU pinning
Date: Tue, 17 Feb 2026 12:01:24 +0100
Message-ID: <20260217114813.2063770-1-d.csapak@proxmox.com>
X-Mailer: git-send-email 2.47.3
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.031 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: F3KBHUCOMHLRB4M55RN6ZDU2D4WMZDJX
X-Message-ID-Hash: F3KBHUCOMHLRB4M55RN6ZDU2D4WMZDJX
X-MailFrom: d.csapak@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

Introduce a new 'pinning' property, which (for now) has two methods for
vCPU pinning:

* numa: pins the vCPUs of each virtual NUMA node to a corresponding host
  NUMA node

* one-to-one: pins each vCPU to a corresponding host CPU, while keeping
  virtual NUMA nodes together on host NUMA nodes

both options respect the 'affinity' setting and the various 'numaX'
setting as good as possible. There are some possible configurations that
are not supported, but these don't make sense from a performance
standpoint (where pinning would be necessary) anyway.

These not supported configurations are when more than one host NUMA node
is assigned to one virtual NUMA node and when more NUMA nodes in the
guest are defined than available on the host, and they're not explicitly
mapped to host NUMA nodes.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---

sending as RFC for now, because I'm still working on implementing some
tests and doing some methodical testing.

preliminary tests show that the pinning (both 'numa' and 'one-to-one',
name pending), show with 'pcm-numa' that the cross numa node ram access
is greatly reduced.

I tested with 'cpubench1a' and a vm with 1 socket 8 cores with default
config  (no numa, no affinity, no pinning) gets ~50M local dram accesses
and ~50M remote dram accesses per second

but with pinning i can see ~130M local dram accesses and <1M remote
ones.

The weird thing is the more specific I pin, the lower the score from the
benchmark gets:

no settings: ~1600
with numa pinning: ~1550
with numa pinning without hyperthreads: ~1450
pin to specific cores (without hyperthreads): ~1450

so I'm still investigating if it's the benchmark or if my assumptions
about performance here are wrong.

I'll test another benchmark too and in my next version i'll include the
results in the commit message.

 src/PVE/QemuServer.pm         |  12 ++
 src/PVE/QemuServer/Makefile   |   1 +
 src/PVE/QemuServer/Pinning.pm | 388 ++++++++++++++++++++++++++++++++++
 3 files changed, 401 insertions(+)
 create mode 100644 src/PVE/QemuServer/Pinning.pm

diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index 5d2dbe03..f055f33f 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -85,6 +85,7 @@ use PVE::QemuServer::Memory qw(get_current_memory);
 use PVE::QemuServer::MetaInfo;
 use PVE::QemuServer::Monitor qw(mon_cmd);
 use PVE::QemuServer::Network;
+use PVE::QemuServer::Pinning;
 use PVE::QemuServer::OVMF;
 use PVE::QemuServer::PCI qw(print_pci_addr print_pcie_addr print_pcie_root_port parse_hostpci);
 use PVE::QemuServer::QemuImage;
@@ -735,6 +736,12 @@ EODESCR
         optional => 1,
         default => 1,
     },
+    pinning => {
+        type => 'string',
+        format => $PVE::QemuServer::Pinning::pinning_fmt,
+        description => "Set pinning options for the guest",
+        optional => 1,
+    },
 };
 
 my $cicustom_fmt = {
@@ -3174,6 +3181,8 @@ sub config_to_command {
         push @$cmd, '/usr/bin/taskset', '--cpu-list', '--all-tasks', $conf->{affinity};
     }
 
+    PVE::QemuServer::Pinning::assert_pinning_constraints($conf);
+
     push @$cmd, $kvm_binary;
 
     push @$cmd, '-id', $vmid;
@@ -5816,6 +5825,9 @@ sub vm_start_nolock {
 
     syslog("info", "VM $vmid started with PID $pid.");
 
+    eval { PVE::QemuServer::Pinning::pin_threads_to_cpus($conf, $vmid) };
+    log_warn("could not pin vCPU threads - $@") if $@;
+
     if (defined(my $migrate = $res->{migrate})) {
         if ($migrate->{proto} eq 'tcp') {
             my $nodename = nodename();
diff --git a/src/PVE/QemuServer/Makefile b/src/PVE/QemuServer/Makefile
index d599ca91..9edcbf6b 100644
--- a/src/PVE/QemuServer/Makefile
+++ b/src/PVE/QemuServer/Makefile
@@ -21,6 +21,7 @@ SOURCES=Agent.pm	\
 	Network.pm	\
 	OVMF.pm		\
 	PCI.pm		\
+	Pinning.pm	\
 	QemuImage.pm	\
 	QMPHelpers.pm	\
 	QSD.pm		\
diff --git a/src/PVE/QemuServer/Pinning.pm b/src/PVE/QemuServer/Pinning.pm
new file mode 100644
index 00000000..c3bc870f
--- /dev/null
+++ b/src/PVE/QemuServer/Pinning.pm
@@ -0,0 +1,388 @@
+package PVE::QemuServer::Pinning;
+
+use v5.36;
+
+use PVE::QemuServer::Memory;
+use PVE::QemuServer::Monitor;
+
+use PVE::Tools qw(dir_glob_foreach run_command);
+
+=head1 NAME
+
+C<PVE::QemuServer::Pinning> - Functions for pinning vCPUs of QEMU guests
+
+=head1 DESCRIPTION
+
+This module contains functions for helping with pinning QEMU guest vCPUS to
+host CPUs, considering other parts of the config like NUMA nodes and affinity.
+
+Before the guest is started C<assert_pinning_constraints> should be called to
+check the pinning constraints.
+
+After the guest is started, C<pin_threads_to_cpus> is called to actually pin
+the vCPU threads to the host CPUs according to the config.
+
+=cut
+
+my $DEFAULT_VCPU_PINNING = "none";
+
+our $pinning_fmt = {
+    'vcpus' => {
+        type => 'string',
+        enum => [qw(one-to-one numa $DEFAULT_VCPU_PINNING)],
+        description => "Set the type of vCPU pinning modes.",
+        verbose_description => <<EODESCR,
+There are multiple ways to pin vcpus to host cores:
+
+* $DEFAULT_VCPU_PINNING (default): either no pinning at all, or when 'affinity'
+  is set, pins all vcpus to the range of given cores by 'affinity'.
+
+* numa: pins the sets of vcpus of each virtual NUMA node, to a corresponding
+  host numa node. This makes memory access consistent for each vcpu, since it
+  won't be rescheduled to a different NUMA node. Takes into account the 'numaX'
+  setting when binding vcpus to host nodes. If 'affinity' is set, only consider
+  those cores.
+
+* one-to-one: tries to pin each vcpu to a specific host core. This prevents
+  vcpus to be rescheduled on other vcpus entirely and can thus reach the most
+  performance, it is the least flexible for the host scheduler. Takes into
+  account the virtual and physical NUMA layout.
+
+This is only supported when there is at least one host NUMA node per virtual
+one, and each manual 'numaX' node is assigned to at most one host NUMA node.
+
+Any of these options won't take into account pinning settings from different
+virtual machines or containers, so to achieve the best and most consistent
+performance, use the combination of 'numaX' and 'affinity' options to make sure
+host cores are not crowded with vcpu assignments.
+
+EODESCR
+        default => 'none',
+        default_key => 1,
+    },
+};
+
+=head2 Restrictions and Constraints
+
+To keep it simple, pinning is limited to a subset of possible NUMA
+configurations. For example, with 'numaX' configs, it's possible to create
+overlapping virtual NUMA nodes, e.g. assigning vCPUs 0-3 to host NUMA nodes 0-1
+and vCPUs 4-7 to host NUMA nodes 1-2. To keep the pinning logic simple, such
+configurations are not supported. Instead the user should assign virtual NUMA
+nodes only to a single host NUMA node.
+
+=cut
+
+=head2 host_numa_node_cpu_list
+
+returns a map of host numa nodes to cpus
+
+=cut
+
+my sub host_numa_node_cpu_list {
+    my $base_path = "/sys/devices/system/node/";
+
+    my $map = {};
+    my $count = 0;
+
+    dir_glob_foreach(
+        $base_path,
+        'node(\d+)',
+        sub {
+            my ($fullnode, $nodeid) = @_;
+            opendir(my $dirfd, "$base_path/$fullnode")
+                || die "cannot open numa node dir $fullnode\n";
+            # this is a hash so we can use the key randomness for selecting
+            my $cpus = {};
+            for my $cpu (readdir($dirfd)) {
+                if ($cpu =~ m/^cpu(\d+)$/) {
+                    $cpus->{$1} = 1;
+                    $count++;
+                }
+            }
+            closedir($dirfd);
+
+            $map->{$nodeid} = $cpus;
+        },
+    );
+
+    return ($count, $map);
+}
+
+=head2 limit_by_affinity
+
+limits a list of map of numa nodes to cpu by a the given affinity map
+
+=cut
+
+my sub limit_by_affinity($host_cpus, $affinity_members) {
+    return $host_cpus if !$affinity_members || scalar($affinity_members->%*) == 0;
+
+    for my $node_id (keys $host_cpus->%*) {
+        my $node = $host_cpus->{$node_id};
+        for my $cpu_id (keys $node->%*) {
+            delete $node->{$cpu_id} if !defined($affinity_members->{$cpu_id});
+        }
+    }
+
+    return $host_cpus;
+}
+
+=head2 get_vnuma_vcpu_map
+
+Returns a hash from virtual NUMA nodes to vCPUs and an optional host NUMA node
+
+=cut
+
+my sub get_vnuma_vcpu_map($conf) {
+    my $map = {};
+    my $sockets = $conf->{sockets} // 1;
+    my $cores = $conf->{cores} // 1;
+    my $vcpu_count = 0;
+
+    if ($conf->{numa}) {
+        for (my $i = 0; $i < $PVE::QemuServer::Memory::MAX_NUMA; $i++) {
+            my $entry = $conf->{"numa$i"} or next;
+            my $numa = PVE::QemuServer::Memory::parse_numa($entry) or next;
+
+            $map->{$i} = { vcpus => {} };
+            for my $cpurange ($numa->{cpus}->@*) {
+                my ($start, $end) = $cpurange->@*;
+                for my $cpu (($start .. ($end // $start))) {
+                    $map->{$i}->{vcpus}->{$cpu} = 1;
+                    $vcpu_count++;
+                }
+            }
+
+            if (my $hostnodes = $numa->{hostnodes}) {
+                die "Pinning only available for 1-to-1 NUMA node mapping\n"
+                    if (scalar($hostnodes->@*) > 1 || defined($hostnodes->[0]->[1]));
+                $map->{$i}->{hostnode} = $hostnodes->[0]->[0];
+            }
+        }
+
+        my $vcpu_maps = scalar(keys $map->%*);
+        if ($vcpu_maps == 0) {
+            for my $socket ((0 .. ($sockets - 1))) {
+                $map->{$socket} = { vcpus => {} };
+                for my $cpu ((0 .. ($cores - 1))) {
+                    my $vcpu = $socket * $cores + $cpu;
+                    $map->{$socket}->{vcpus}->{$vcpu} = 1;
+                    $vcpu_count++;
+                }
+            }
+        }
+
+        die "Invalid NUMA configuration for pinning, some vCPUs missing in numa binding\n"
+            if $vcpu_count != $sockets * $cores;
+    } else {
+        # numa not enabled so all are on the same node
+        $map->{0} = { vcpus => {} };
+        for my $i ((0 .. ($sockets * $cores - 1))) {
+            $map->{0}->{vcpus}->{$i} = 1;
+        }
+    }
+
+    return $map;
+}
+
+=head2 get_filtered_host_cpus
+
+Returns the map of host numa nodes to CPUs after applying the optional affinity from the config.
+
+=cut
+
+sub get_filtered_host_cpus($conf) {
+    my ($host_cpu_count, $host_cpus) = host_numa_node_cpu_list();
+
+    if (my $affinity = $conf->{affinity}) {
+        my ($_affinity_count, $affinity_members) = PVE::CpuSet::parse_cpuset($affinity);
+        $host_cpus = limit_by_affinity($host_cpus, $affinity_members);
+    }
+
+    return $host_cpus;
+}
+
+=head2 get_vcpu_to_host_numa_map
+
+Returns a map from vcpus to host numa nodes, and also checks the size constraints currently supported
+
+=cut
+
+sub get_vcpu_to_host_numa_map($conf, $host_cpus) {
+    my $numa_vcpu_map = get_vnuma_vcpu_map($conf);
+
+    my $map = {};
+
+    if ($conf->{numa}) {
+        my $used_host_nodes = {};
+
+        # fill the map of requested host numa nodes with their vcpu count
+        for my $numa_node (sort keys $numa_vcpu_map->%*) {
+            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
+            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
+            if (defined($hostnode)) {
+                my $vcpu_count = scalar(keys $vcpus->%*);
+
+                $used_host_nodes->{$hostnode} //= 0;
+                $used_host_nodes->{$hostnode} += $vcpu_count;
+            }
+        }
+
+        # check if there are enough host cpus for the vcpus
+        for my $hostnode (keys $used_host_nodes->%*) {
+            my $vcpu_count = $used_host_nodes->{$hostnode};
+            my $host_cpu_count = scalar(keys $host_cpus->{$hostnode}->%*);
+
+            die "Not enough CPUs available on NUMA node $hostnode\n"
+                if $host_cpu_count < $vcpu_count;
+        }
+
+        # try to fit remaining virtual numa nodes to real ones
+        for my $numa_node (sort keys $numa_vcpu_map->%*) {
+            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
+            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
+            if (!defined($hostnode)) {
+                # try real nodes in ascending order until we find one that fits
+                # NOTE: this is not the optimal solution as this would probably be NP hard,
+                # as it's similar to the Bin Packing problem
+                for my $node (sort keys $host_cpus->%*) {
+                    next if $used_host_nodes->{$node};
+                    my $host_cpu_count = scalar(keys $host_cpus->{$node}->%*);
+                    my $vcpu_count = scalar(keys $numa_vcpu_map->{$numa_node}->{vcpus}->%*);
+                    next if $host_cpu_count < $vcpu_count;
+
+                    $hostnode = $node;
+                    $used_host_nodes->{$hostnode} //= 0;
+                    $used_host_nodes->{$hostnode} += $vcpu_count;
+                    last;
+                }
+
+                die "Could not find a fitting host NUMA node for guest NUMA node $numa_node\n"
+                    if !defined($hostnode);
+                $numa_vcpu_map->{$numa_node}->{hostnode} = $hostnode;
+            }
+        }
+
+        # now every virtual numa node has a fitting host numa node and we can map from vcpu -> numa node
+        for my $numa_node (keys $numa_vcpu_map->%*) {
+            my $vcpus = $numa_vcpu_map->{$numa_node}->{vcpus};
+            my $hostnode = $numa_vcpu_map->{$numa_node}->{hostnode};
+
+            for my $vcpu (keys $vcpus->%*) {
+                $map->{$vcpu} = $hostnode;
+            }
+        }
+    } else {
+        my $vcpus = ($conf->{sockets} // 1) * ($conf->{cores} // 1);
+        my $host_cpu_count = 0;
+        for my $node (keys $host_cpus->%*) {
+            $host_cpu_count += scalar(keys $host_cpus->{$node}->%*);
+        }
+        die "not enough available CPUs (limited by affinity) to pin\n" if $vcpus > $host_cpu_count;
+        # returning empty list means the code can choose any cpu from the available ones
+    }
+    return $map;
+}
+
+=head2 choose_single_cpu
+
+Selects a CPU from the available C<$host_cpus> with the help of the given index
+C<$vcpu> and the vCPU to NUMA node map C<$vcpu_map>.
+
+It modifies the C<$host_cpus> hash to reserve the chosen CPU.
+
+=cut
+
+sub choose_single_cpu($vcpu_map, $host_cpus, $vcpu) {
+    my $hostnode = $vcpu_map->{$vcpu};
+    if (!defined($hostnode)) {
+        # choose a numa node at random
+        $hostnode = (keys $host_cpus->%*)[0];
+    }
+
+    # choose one at random
+    my $real_cpu = (keys $host_cpus->{$hostnode}->%*)[0];
+    delete $host_cpus->{$hostnode}->{$real_cpu};
+    if (scalar($host_cpus->{$hostnode}->%*) == 0) {
+        delete $host_cpus->{$hostnode};
+    }
+
+    return $real_cpu;
+}
+
+=head2 get_numa_cpulist
+
+Returns the list of usable CPUs for the given C<$vcpu> index with the help of
+the vCPU to NUMA node map C<$vcpu_map> and the available C<$host_cpus> as a
+string usable by the 'taskset' command.
+
+=cut
+
+sub get_numa_cpulist($vcpu_map, $host_cpus, $vcpu) {
+    my $hostnode = $vcpu_map->{$vcpu};
+    if (!defined($hostnode)) {
+        # if there is not explicit mapping, simply don't pin at all
+        return undef;
+    }
+
+    return join(',', keys $host_cpus->{$hostnode}->%*);
+}
+
+=head2 assert_pinning_constraints
+
+Used to verify the constraints from pinnning by trying to construct
+the pinning configuration. Useful to check the config before actually starting
+the guest.
+
+=cut
+
+sub assert_pinning_constraints($conf) {
+
+    my $pinning = $conf->{pinning} // $DEFAULT_VCPU_PINNING;
+    if ($pinning ne $DEFAULT_VCPU_PINNING) {
+        my $host_cpus = get_filtered_host_cpus($conf);
+        get_vcpu_to_host_numa_map($conf, $host_cpus);
+    }
+}
+
+=head2 pin_threads_to_cpus
+
+Pins the vCPU threads of a running guest to host CPUs according to the
+pinning, affinity and NUMA configuraton.
+
+Needs the guest to be running, since it querys QMP for the vCPU thread list.
+
+=cut
+
+sub pin_threads_to_cpus($conf, $vmid) {
+    my $pinning = $conf->{pinning} // $DEFAULT_VCPU_PINNING;
+    if ($pinning ne $DEFAULT_VCPU_PINNING) {
+        my $host_cpus = get_filtered_host_cpus($conf);
+        my $vcpu_map = get_vcpu_to_host_numa_map($conf, $host_cpus);
+
+        my $cpuinfo = PVE::QemuServer::Monitor::mon_cmd($vmid, 'query-cpus-fast');
+        for my $vcpu ($cpuinfo->@*) {
+
+            my $vcpu_index = $vcpu->{'cpu-index'};
+
+            my $cpus;
+            if ($pinning eq 'one-to-one') {
+                $cpus = choose_single_cpu($vcpu_map, $host_cpus, $vcpu_index);
+            } elsif ($pinning eq 'numa') {
+                $cpus = get_numa_cpulist($vcpu_map, $host_cpus, $vcpu_index);
+            }
+
+            die "no cpus selected for pinning vcpu $vcpu_index\n"
+                if !defined($cpus);
+
+            my $tid = $vcpu->{'thread-id'};
+            print "pinning vcpu $vcpu_index (thread $tid) to cpu(s) $cpus\n";
+            run_command(
+                ['taskset', '-c', '-p', $cpus, $vcpu->{'thread-id'}], logfunc => sub { },
+            );
+        }
+    }
+}
+
+1;
-- 
2.47.3