[pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
Date: Tue, 25 Nov 2025 14:51:05 +0100	[thread overview]
Message-ID: <20251125135107.561633-1-f.gruenbichler@proxmox.com> (raw)

after a certain amount of KSM sharing, PSS lookups become prohibitively
expensive. fallback to RSS (which was used before) in that case, to avoid
vmstatus calls blocking for long periods of time.

I benchmarked this with 3 VMs running with different levels of KSM sharing. in
the output below, "merged pages" refers to the contents of
/proc/$pid/ksm_merging_pages, extract_pss is the parsing code for cumulative
PSS of a VM cgroup isolated, extract_rss is the parsing code for cumulative RSS
of a VM cgroup isolated, qm_status_stock is `qm status $vmid --verbose`, and
qm_status_patched is `perl -I./src/PVE ./src/bin/qm status $vmid --verbose`
with this patch applied.

first, a VM with barely any sharing:

merged pages: 1574

Benchmark 1: extract_pss
  Time (mean ± σ):      15.0 ms ±   0.6 ms    [User: 4.2 ms, System: 10.8 ms]
  Range (min … max):    14.1 ms …  17.0 ms    173 runs

Benchmark 2: extract_rss
  Time (mean ± σ):       5.8 ms ±   0.3 ms    [User: 4.3 ms, System: 1.5 ms]
  Range (min … max):     5.3 ms …   7.7 ms    466 runs

Summary
  extract_rss ran
    2.56 ± 0.16 times faster than extract_pss

Benchmark 1: qm_status_stock
  Time (mean ± σ):     363.5 ms ±   5.6 ms    [User: 290.8 ms, System: 68.5 ms]
  Range (min … max):   353.1 ms … 370.4 ms    10 runs

Benchmark 2: qm_status_patched
  Time (mean ± σ):     360.6 ms ±   4.2 ms    [User: 285.4 ms, System: 71.0 ms]
  Range (min … max):   355.0 ms … 366.5 ms    10 runs

Summary
  qm_status_patched ran
    1.01 ± 0.02 times faster than qm_status_stock

shows very little difference in total status runtime.

next, a VM with modest sharing:

merged pages: 52118

Benchmark 1: extract_pss
  Time (mean ± σ):      57.1 ms ±   1.3 ms    [User: 4.3 ms, System: 52.8 ms]
  Range (min … max):    54.6 ms …  60.4 ms    50 runs

Benchmark 2: extract_rss
  Time (mean ± σ):       6.0 ms ±   0.3 ms    [User: 4.3 ms, System: 1.6 ms]
  Range (min … max):     5.4 ms …   6.9 ms    464 runs

Summary
  extract_rss ran
    9.60 ± 0.52 times faster than extract_pss

Benchmark 1: qm_status_stock
  Time (mean ± σ):     407.9 ms ±   5.9 ms    [User: 288.2 ms, System: 115.0 ms]
  Range (min … max):   402.2 ms … 419.3 ms    10 runs

Benchmark 2: qm_status_patched
  Time (mean ± σ):     412.9 ms ±   7.6 ms    [User: 294.4 ms, System: 113.9 ms]
  Range (min … max):   405.9 ms … 425.8 ms    10 runs

Summary
  qm_status_stock ran
    1.01 ± 0.02 times faster than qm_status_patched

while the stat extraction alone would be a lot faster via RSS, the total status
runtime is still a lot bigger (the patched `qm status` will still use PSS in
this case!).

and now a VM with the problematic behaviour caused by lots of sharing (~12GB):

merged pages: 3095741

Benchmark 1: extract_pss
  Time (mean ± σ):     583.2 ms ±   4.6 ms    [User: 3.9 ms, System: 579.1 ms]
  Range (min … max):   573.9 ms … 591.7 ms    10 runs

Benchmark 2: extract_rss
  Time (mean ± σ):       6.0 ms ±   0.3 ms    [User: 4.2 ms, System: 1.7 ms]
  Range (min … max):     5.4 ms …   7.3 ms    412 runs

Summary
  extract_rss ran
   97.66 ± 5.00 times faster than extract_pss

extraction via PSS alone is now slower than the whole status call with RSS:

Benchmark 1: qm_status_stock
  Time (mean ± σ):     935.5 ms ±   8.4 ms    [User: 292.2 ms, System: 638.6 ms]
  Range (min … max):   924.8 ms … 952.0 ms    10 runs

Benchmark 2: qm_status_patched
  Time (mean ± σ):     359.9 ms ±   7.6 ms    [User: 295.1 ms, System: 60.3 ms]
  Range (min … max):   350.1 ms … 371.3 ms    10 runs

Summary
  qm_status_patched ran
    2.60 ± 0.06 times faster than qm_status_stock

Fixes: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---

Notes:
    the threshold is a bit arbitrary, we could also consider setting it
    lower to be on the safe side, or make it relative to the total
    number of pages of memory..
    
    one issue with this approach is that if KSM is disabled later on and
    all the merging is undone, the problematic behaviour remains, and
    there is - AFAICT - no trace of this state in `ksm_stat` of the
    process or elsewhere. the behaviour goes away if the VM is stopped
    and started again. instead of doing a per-pid decision, we might
    want to opt for setting a global RSS fallback in case KSM is
    detected as active on the host?

    we should of course also investigate further whether this is fixable
    or improvable on the kernel side..

 src/PVE/QemuServer.pm | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index a7fbec14..82e9c004 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -2333,19 +2333,28 @@ my sub get_vmid_total_cgroup_memory_usage {
     if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
         while (my $pid = <$procs_fh>) {
             chomp($pid);
+            my $filename = 'smaps_rollup';
+            my $extract_usage_re = qr/^Pss:\s+([0-9]+) kB$/;
 
-            open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
+            my $ksm_pages = PVE::Tools::file_read_firstline("/proc/$pid/ksm_merging_pages");
+            # more than 1G shared via KSM, smaps_rollup will be slow, fall back to RSS
+            if ($ksm_pages && $ksm_pages > 1024 * 1024 / 4) {
+                $filename = 'status';
+                $extract_usage_re = qr/^VmRSS:\s+([0-9]+) kB$/;
+            }
+
+            open(my $pid_fh, '<', "/proc/${pid}/${filename}")
                 or $!{ENOENT}
-                or die "failed to open PSS memory-stat from process - $!\n";
-            next if !defined($smaps_fh);
+                or die "failed to open /proc/${pid}/${filename} - $!\n";
+            next if !defined($pid_fh);
 
-            while (my $line = <$smaps_fh>) {
-                if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
+            while (my $line = <$pid_fh>) {
+                if ($line =~ $extract_usage_re) {
                     $memory_usage += int($1) * 1024;
                     last; # end inner while loop, go to next $pid
                 }
             }
-            close $smaps_fh;
+            close $pid_fh;
         }
         close($procs_fh);
     }
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel