From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH qemu-server] fix #6935: vmstatus: use CGroup for host memory usage
Date: Fri, 28 Nov 2025 11:36:34 +0100 [thread overview]
Message-ID: <20251128103639.446372-1-f.gruenbichler@proxmox.com> (raw)
after a certain amount of KSM sharing, PSS lookups become prohibitively
expensive. instead of reverting to the old broken method, simply use the
cgroup's memory usage as `memhost` value.
this does not account for merged pages because of KSM anymore.
I benchmarked this with 4 VMs running with different levels of KSM sharing. in
the output below, "merged pages" refers to the contents of
/proc/$pid/ksm_merging_pages, the extract_* benchmark runs refer to four
different variants of extracting memory usage, with the actual extraction part
running 1000x in a loop for each run to amortize perl/process setup costs,
qm_status_stock is `qm status $vmid --verbose`, and qm_status_pateched is `perl
-I./src/PVE ./src/bin/qm status $vmid --verbose` with this patch applied.
the variants:
- extract_pss: status before this patch, query smaps_rollup for each process
that is part of the qemu.slice of the VM
- extract_rss: extract VmRSS from the `/proc/$pid/status` file of the main
process
- extract_rss_cgroup: like _rss, but for each process of the slice
- extract_cgroup: use PVE::QemuServer::CGroup get_memory_stat (this patch)
first, with no KSM active
VMID: 113
pss: 724971520
rss: 733282304
cgroup: 727617536
rss_cgroup: 733282304
Benchmark 1: extract_pss 271.2 ms ± 6.0 ms [User: 226.3 ms, System: 44.7 ms]
Benchmark 2: extract_rss 267.8 ms ± 3.6 ms [User: 223.9 ms, System: 43.7 ms]
Benchmark 3: extract_cgroup 273.5 ms ± 6.2 ms [User: 227.2 ms, System: 46.2 ms]
Benchmark 4: extract_rss_cgroup 270.5 ms ± 3.7 ms [User: 225.0 ms, System: 45.3 ms]
both reported usage and runtime in the same ballpark
VMID: 838383 (with 48G of memory):
pss: 40561564672
rss: 40566108160
cgroup: 40961339392
rss-cgroup: 40572141568
usage in the same ballpark
Benchmark 1: extract_pss 732.0 ms ± 4.4 ms [User: 224.8 ms, System: 506.8 ms]
Benchmark 2: extract_rss 272.1 ms ± 5.2 ms [User: 227.8 ms, System: 44.0 ms]
Benchmark 3: extract_cgroup 274.2 ms ± 2.2 ms [User: 227.8 ms, System: 46.2 ms]
Benchmark 4: extract_rss_cgroup 270.9 ms ± 3.9 ms [User: 224.9 ms, System: 45.8 ms]
but PSS already a lot slower..
Benchmark 1: qm_status_stock 820.9 ms ± 7.5 ms [User: 293.1 ms, System: 523.3 ms]
Benchmark 2: qm_status_patched 356.2 ms ± 5.6 ms [User: 290.2 ms, System: 61.5 ms]
which is also visible in the before and after
the other two VMs behaved as 113
and now with KSM active
VMID: 113
merged pages: 10747 (very little)
pss: 559815680
rss: 594853888
cgroup: 568197120
rss-cgroup: 594853888
Benchmark 1: extract_pss 280.0 ms ± 2.4 ms [User: 229.5 ms, System: 50.2 ms]
Benchmark 2: extract_rss 274.8 ms ± 3.7 ms [User: 225.9 ms, System: 48.7 ms]
Benchmark 3: extract_cgroup 279.0 ms ± 4.6 ms [User: 228.0 ms, System: 50.7 ms]
Benchmark 4: extract_rss_cgroup 274.7 ms ± 6.7 ms [User: 228.0 ms, System: 46.4 ms]
still same ball park
VMID: 838383 (with 48G of memory)
merged pages: 6696434 (a lot - this is 25G worth of pages!)
pss: 12411169792
rss: 38772117504
cgroup: 12799062016
rss-cgroup: 38778150912
RSS based are roughly the same, but cgroup gives us almost the same numbers as
PSS despite KSM being active!
Benchmark 1: extract_pss 691.7 ms ± 3.4 ms [User: 225.5 ms, System: 465.8 ms]
Benchmark 2: extract_rss 276.3 ms ± 7.1 ms [User: 227.4 ms, System: 48.6 ms]
Benchmark 3: extract_cgroup 277.8 ms ± 4.4 ms [User: 228.5 ms, System: 49.1 ms]
Benchmark 4: extract_rss_cgroup 274.7 ms ± 3.5 ms [User: 226.6 ms, System: 47.8 ms]
but it is still fast!
Benchmark 1: qm_status_stock 771.8 ms ± 7.2 ms [User: 296.0 ms, System: 471.0 ms]
Benchmark 2: qm_status_patched 360.2 ms ± 5.1 ms [User: 287.1 ms, System: 68.5 ms]
confirmed by `qm status` as well
VMID: 838384
merged pages: 165540 (little, this is about 645MB worth of pages)
pss: 2522527744
rss: 2927058944
cgroup: 2500329472
rss-cgroup: 2932944896
Benchmark 1: extract_pss 318.4 ms ± 3.6 ms [User: 227.3 ms, System: 90.8 ms]
Benchmark 2: extract_rss 273.9 ms ± 5.8 ms [User: 226.5 ms, System: 47.2 ms]
Benchmark 3: extract_cgroup 276.3 ms ± 4.1 ms [User: 225.4 ms, System: 50.7 ms]
Benchmark 4: extract_rss_cgroup 276.5 ms ± 8.6 ms [User: 226.1 ms, System: 50.1 ms]
Benchmark 1: qm_status_stock 400.2 ms ± 6.6 ms [User: 292.1 ms, System: 103.5 ms]
Benchmark 2: qm_status_patched 357.0 ms ± 4.1 ms [User: 288.7 ms, System: 63.7 ms]
results match those of 838383, just with less effect
the fourth VM matches this as well.
Fixes/Reverts: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
Notes:
given the numbers, going with the CGroup-based approach seems best - it gives
us accurate numbers without the slowdown, and gives users an insight into how
KSM affects their guests host memory usage without flip-flopping.
src/PVE/QemuServer.pm | 35 ++++-------------------------------
1 file changed, 4 insertions(+), 31 deletions(-)
diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index a7fbec14..62d835a5 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -2324,35 +2324,6 @@ sub vzlist {
return $vzlist;
}
-# Iterate over all PIDs inside a VMID's cgroup slice and accumulate their PSS (proportional set
-# size) to get a relatively telling effective memory usage of all processes involved with a VM.
-my sub get_vmid_total_cgroup_memory_usage {
- my ($vmid) = @_;
-
- my $memory_usage = 0;
- if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
- while (my $pid = <$procs_fh>) {
- chomp($pid);
-
- open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
- or $!{ENOENT}
- or die "failed to open PSS memory-stat from process - $!\n";
- next if !defined($smaps_fh);
-
- while (my $line = <$smaps_fh>) {
- if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
- $memory_usage += int($1) * 1024;
- last; # end inner while loop, go to next $pid
- }
- }
- close $smaps_fh;
- }
- close($procs_fh);
- }
-
- return $memory_usage;
-}
-
our $vmstatus_return_properties = {
vmid => get_standard_option('pve-vmid'),
status => {
@@ -2614,9 +2585,11 @@ sub vmstatus {
$d->{uptime} = int(($uptime - $pstat->{starttime}) / $cpuinfo->{user_hz});
- $d->{memhost} = get_vmid_total_cgroup_memory_usage($vmid);
+ my $cgroup = PVE::QemuServer::CGroup->new($vmid);
+ my $cgroup_mem = $cgroup->get_memory_stat();
+ $d->{memhost} = $cgroup_mem->{mem} // 0;
- $d->{mem} = $d->{memhost}; # default to cgroup PSS sum, balloon info can override this below
+ $d->{mem} = $d->{memhost}; # default to cgroup, balloon info can override this below
my $pressures = PVE::ProcFSTools::read_cgroup_pressure("qemu.slice/${vmid}.scope");
$d->{pressurecpusome} = $pressures->{cpu}->{some}->{avg10} * 1;
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
reply other threads:[~2025-11-28 10:36 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251128103639.446372-1-f.gruenbichler@proxmox.com \
--to=f.gruenbichler@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.