[pve-devel] [PATCH qemu-server] fix #6935: vmstatus: use CGroup for host memory usage

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH qemu-server] fix #6935: vmstatus: use CGroup for host memory usage
Date: Fri, 28 Nov 2025 11:36:34 +0100	[thread overview]
Message-ID: <20251128103639.446372-1-f.gruenbichler@proxmox.com> (raw)

after a certain amount of KSM sharing, PSS lookups become prohibitively
expensive. instead of reverting to the old broken method, simply use the
cgroup's memory usage as `memhost` value.

this does not account for merged pages because of KSM anymore.

I benchmarked this with 4 VMs running with different levels of KSM sharing. in
the output below, "merged pages" refers to the contents of
/proc/$pid/ksm_merging_pages, the extract_* benchmark runs refer to four
different variants of extracting memory usage, with the actual extraction part
running 1000x in a loop for each run to amortize perl/process setup costs,
qm_status_stock is `qm status $vmid --verbose`, and qm_status_pateched is `perl
-I./src/PVE ./src/bin/qm status $vmid --verbose` with this patch applied.

the variants:
- extract_pss: status before this patch, query smaps_rollup for each process
  that is part of the qemu.slice of the VM
- extract_rss: extract VmRSS from the `/proc/$pid/status` file of the main
  process
- extract_rss_cgroup: like _rss, but for each process of the slice
- extract_cgroup: use PVE::QemuServer::CGroup get_memory_stat (this patch)

first, with no KSM active

VMID: 113

pss:        724971520
rss:        733282304
cgroup:     727617536
rss_cgroup: 733282304

Benchmark 1: extract_pss        271.2 ms ±   6.0 ms    [User: 226.3 ms, System: 44.7 ms]
Benchmark 2: extract_rss        267.8 ms ±   3.6 ms    [User: 223.9 ms, System: 43.7 ms]
Benchmark 3: extract_cgroup     273.5 ms ±   6.2 ms    [User: 227.2 ms, System: 46.2 ms]
Benchmark 4: extract_rss_cgroup 270.5 ms ±   3.7 ms    [User: 225.0 ms, System: 45.3 ms]

both reported usage and runtime in the same ballpark

VMID: 838383 (with 48G of memory):

pss:        40561564672
rss:        40566108160
cgroup:     40961339392
rss-cgroup: 40572141568

usage in the same ballpark

Benchmark 1: extract_pss        732.0 ms ±   4.4 ms    [User: 224.8 ms, System: 506.8 ms]
Benchmark 2: extract_rss        272.1 ms ±   5.2 ms    [User: 227.8 ms, System: 44.0 ms]
Benchmark 3: extract_cgroup     274.2 ms ±   2.2 ms    [User: 227.8 ms, System: 46.2 ms]
Benchmark 4: extract_rss_cgroup 270.9 ms ±   3.9 ms    [User: 224.9 ms, System: 45.8 ms]

but PSS already a lot slower..

Benchmark 1: qm_status_stock   820.9 ms ±   7.5 ms    [User: 293.1 ms, System: 523.3 ms]
Benchmark 2: qm_status_patched 356.2 ms ±   5.6 ms    [User: 290.2 ms, System: 61.5 ms]

which is also visible in the before and after

the other two VMs behaved as 113

and now with KSM active

VMID: 113
merged pages: 10747 (very little)

pss:        559815680
rss:        594853888
cgroup:     568197120
rss-cgroup: 594853888

Benchmark 1: extract_pss        280.0 ms ±   2.4 ms    [User: 229.5 ms, System: 50.2 ms]
Benchmark 2: extract_rss        274.8 ms ±   3.7 ms    [User: 225.9 ms, System: 48.7 ms]
Benchmark 3: extract_cgroup     279.0 ms ±   4.6 ms    [User: 228.0 ms, System: 50.7 ms]
Benchmark 4: extract_rss_cgroup 274.7 ms ±   6.7 ms    [User: 228.0 ms, System: 46.4 ms]

still same ball park

VMID: 838383 (with 48G of memory)
merged pages: 6696434 (a lot - this is 25G worth of pages!)

pss:        12411169792
rss:        38772117504
cgroup:     12799062016
rss-cgroup: 38778150912

RSS based are roughly the same, but cgroup gives us almost the same numbers as
PSS despite KSM being active!

Benchmark 1: extract_pss        691.7 ms ±   3.4 ms    [User: 225.5 ms, System: 465.8 ms]
Benchmark 2: extract_rss        276.3 ms ±   7.1 ms    [User: 227.4 ms, System: 48.6 ms]
Benchmark 3: extract_cgroup     277.8 ms ±   4.4 ms    [User: 228.5 ms, System: 49.1 ms]
Benchmark 4: extract_rss_cgroup 274.7 ms ±   3.5 ms    [User: 226.6 ms, System: 47.8 ms]

but it is still fast!

Benchmark 1: qm_status_stock   771.8 ms ±   7.2 ms    [User: 296.0 ms, System: 471.0 ms]
Benchmark 2: qm_status_patched 360.2 ms ±   5.1 ms    [User: 287.1 ms, System: 68.5 ms]

confirmed by `qm status` as well

VMID: 838384
merged pages: 165540 (little, this is about 645MB worth of pages)

pss:        2522527744
rss:        2927058944
cgroup:     2500329472
rss-cgroup: 2932944896

Benchmark 1: extract_pss        318.4 ms ±   3.6 ms    [User: 227.3 ms, System: 90.8 ms]
Benchmark 2: extract_rss        273.9 ms ±   5.8 ms    [User: 226.5 ms, System: 47.2 ms]
Benchmark 3: extract_cgroup     276.3 ms ±   4.1 ms    [User: 225.4 ms, System: 50.7 ms]
Benchmark 4: extract_rss_cgroup 276.5 ms ±   8.6 ms    [User: 226.1 ms, System: 50.1 ms]

Benchmark 1: qm_status_stock   400.2 ms ±   6.6 ms    [User: 292.1 ms, System: 103.5 ms]
Benchmark 2: qm_status_patched 357.0 ms ±   4.1 ms    [User: 288.7 ms, System: 63.7 ms]

results match those of 838383, just with less effect

the fourth VM matches this as well.

Fixes/Reverts: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---

Notes:
    given the numbers, going with the CGroup-based approach seems best - it gives
    us accurate numbers without the slowdown, and gives users an insight into how
    KSM affects their guests host memory usage without flip-flopping.

 src/PVE/QemuServer.pm | 35 ++++-------------------------------
 1 file changed, 4 insertions(+), 31 deletions(-)

diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index a7fbec14..62d835a5 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -2324,35 +2324,6 @@ sub vzlist {
     return $vzlist;
 }
 
-# Iterate over all PIDs inside a VMID's cgroup slice and accumulate their PSS (proportional set
-# size) to get a relatively telling effective memory usage of all processes involved with a VM.
-my sub get_vmid_total_cgroup_memory_usage {
-    my ($vmid) = @_;
-
-    my $memory_usage = 0;
-    if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
-        while (my $pid = <$procs_fh>) {
-            chomp($pid);
-
-            open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
-                or $!{ENOENT}
-                or die "failed to open PSS memory-stat from process - $!\n";
-            next if !defined($smaps_fh);
-
-            while (my $line = <$smaps_fh>) {
-                if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
-                    $memory_usage += int($1) * 1024;
-                    last; # end inner while loop, go to next $pid
-                }
-            }
-            close $smaps_fh;
-        }
-        close($procs_fh);
-    }
-
-    return $memory_usage;
-}
-
 our $vmstatus_return_properties = {
     vmid => get_standard_option('pve-vmid'),
     status => {
@@ -2614,9 +2585,11 @@ sub vmstatus {
 
         $d->{uptime} = int(($uptime - $pstat->{starttime}) / $cpuinfo->{user_hz});
 
-        $d->{memhost} = get_vmid_total_cgroup_memory_usage($vmid);
+        my $cgroup = PVE::QemuServer::CGroup->new($vmid);
+        my $cgroup_mem = $cgroup->get_memory_stat();
+        $d->{memhost} = $cgroup_mem->{mem} // 0;
 
-        $d->{mem} = $d->{memhost}; # default to cgroup PSS sum, balloon info can override this below
+        $d->{mem} = $d->{memhost}; # default to cgroup, balloon info can override this below
 
         my $pressures = PVE::ProcFSTools::read_cgroup_pressure("qemu.slice/${vmid}.scope");
         $d->{pressurecpusome} = $pressures->{cpu}->{some}->{avg10} * 1;
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel