From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>,
"Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Subject: Re: [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
Date: Tue, 25 Nov 2025 15:08:43 +0100 [thread overview]
Message-ID: <7fe6a9b5-3950-4195-9aaf-8112fcf68aa4@proxmox.com> (raw)
In-Reply-To: <20251125135107.561633-1-f.gruenbichler@proxmox.com>
Am 25.11.25 um 14:51 schrieb Fabian Grünbichler:
> after a certain amount of KSM sharing, PSS lookups become prohibitively
> expensive. fallback to RSS (which was used before) in that case, to avoid
> vmstatus calls blocking for long periods of time.
>
> I benchmarked this with 3 VMs running with different levels of KSM sharing. in
> the output below, "merged pages" refers to the contents of
> /proc/$pid/ksm_merging_pages, extract_pss is the parsing code for cumulative
> PSS of a VM cgroup isolated, extract_rss is the parsing code for cumulative RSS
> of a VM cgroup isolated, qm_status_stock is `qm status $vmid --verbose`, and
> qm_status_patched is `perl -I./src/PVE ./src/bin/qm status $vmid --verbose`
> with this patch applied.
>
> first, a VM with barely any sharing:
>
> merged pages: 1574
>
> Benchmark 1: extract_pss
> Time (mean ± σ): 15.0 ms ± 0.6 ms [User: 4.2 ms, System: 10.8 ms]
> Range (min … max): 14.1 ms … 17.0 ms 173 runs
>
> Benchmark 2: extract_rss
> Time (mean ± σ): 5.8 ms ± 0.3 ms [User: 4.3 ms, System: 1.5 ms]
> Range (min … max): 5.3 ms … 7.7 ms 466 runs
>
> Summary
> extract_rss ran
> 2.56 ± 0.16 times faster than extract_pss
>
> Benchmark 1: qm_status_stock
> Time (mean ± σ): 363.5 ms ± 5.6 ms [User: 290.8 ms, System: 68.5 ms]
> Range (min … max): 353.1 ms … 370.4 ms 10 runs
>
> Benchmark 2: qm_status_patched
> Time (mean ± σ): 360.6 ms ± 4.2 ms [User: 285.4 ms, System: 71.0 ms]
> Range (min … max): 355.0 ms … 366.5 ms 10 runs
>
> Summary
> qm_status_patched ran
> 1.01 ± 0.02 times faster than qm_status_stock
>
> shows very little difference in total status runtime.
>
> next, a VM with modest sharing:
>
> merged pages: 52118
>
> Benchmark 1: extract_pss
> Time (mean ± σ): 57.1 ms ± 1.3 ms [User: 4.3 ms, System: 52.8 ms]
> Range (min … max): 54.6 ms … 60.4 ms 50 runs
>
> Benchmark 2: extract_rss
> Time (mean ± σ): 6.0 ms ± 0.3 ms [User: 4.3 ms, System: 1.6 ms]
> Range (min … max): 5.4 ms … 6.9 ms 464 runs
>
> Summary
> extract_rss ran
> 9.60 ± 0.52 times faster than extract_pss
>
> Benchmark 1: qm_status_stock
> Time (mean ± σ): 407.9 ms ± 5.9 ms [User: 288.2 ms, System: 115.0 ms]
> Range (min … max): 402.2 ms … 419.3 ms 10 runs
>
> Benchmark 2: qm_status_patched
> Time (mean ± σ): 412.9 ms ± 7.6 ms [User: 294.4 ms, System: 113.9 ms]
> Range (min … max): 405.9 ms … 425.8 ms 10 runs
>
> Summary
> qm_status_stock ran
> 1.01 ± 0.02 times faster than qm_status_patched
>
> while the stat extraction alone would be a lot faster via RSS, the total status
> runtime is still a lot bigger (the patched `qm status` will still use PSS in
> this case!).
>
> and now a VM with the problematic behaviour caused by lots of sharing (~12GB):
>
> merged pages: 3095741
>
> Benchmark 1: extract_pss
> Time (mean ± σ): 583.2 ms ± 4.6 ms [User: 3.9 ms, System: 579.1 ms]
> Range (min … max): 573.9 ms … 591.7 ms 10 runs
>
> Benchmark 2: extract_rss
> Time (mean ± σ): 6.0 ms ± 0.3 ms [User: 4.2 ms, System: 1.7 ms]
> Range (min … max): 5.4 ms … 7.3 ms 412 runs
>
> Summary
> extract_rss ran
> 97.66 ± 5.00 times faster than extract_pss
>
> extraction via PSS alone is now slower than the whole status call with RSS:
>
> Benchmark 1: qm_status_stock
> Time (mean ± σ): 935.5 ms ± 8.4 ms [User: 292.2 ms, System: 638.6 ms]
> Range (min … max): 924.8 ms … 952.0 ms 10 runs
>
> Benchmark 2: qm_status_patched
> Time (mean ± σ): 359.9 ms ± 7.6 ms [User: 295.1 ms, System: 60.3 ms]
> Range (min … max): 350.1 ms … 371.3 ms 10 runs
>
> Summary
> qm_status_patched ran
> 2.60 ± 0.06 times faster than qm_status_stock
>
> Fixes: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")
>
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
>
> Notes:
> the threshold is a bit arbitrary, we could also consider setting it
> lower to be on the safe side, or make it relative to the total
> number of pages of memory..
>
> one issue with this approach is that if KSM is disabled later on and
> all the merging is undone, the problematic behaviour remains, and
> there is - AFAICT - no trace of this state in `ksm_stat` of the
> process or elsewhere. the behaviour goes away if the VM is stopped
> and started again. instead of doing a per-pid decision, we might
> want to opt for setting a global RSS fallback in case KSM is
> detected as active on the host?
One can now also disable KSM per VM, so that config property should be
checked too if we go that route.
>
> we should of course also investigate further whether this is fixable
> or improvable on the kernel side..
>
> src/PVE/QemuServer.pm | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
> index a7fbec14..82e9c004 100644
> --- a/src/PVE/QemuServer.pm
> +++ b/src/PVE/QemuServer.pm
> @@ -2333,19 +2333,28 @@ my sub get_vmid_total_cgroup_memory_usage {
> if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
Just to be sure: The stats from memory.current or memory.stat inside the
/sys/fs/cgroup/qemu.slice/${vmid}.scope/ directory is definitively not
enough for our usecases?
> while (my $pid = <$procs_fh>) {
> chomp($pid);
> + my $filename = 'smaps_rollup';
> + my $extract_usage_re = qr/^Pss:\s+([0-9]+) kB$/;
>
> - open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
> + my $ksm_pages = PVE::Tools::file_read_firstline("/proc/$pid/ksm_merging_pages");
> + # more than 1G shared via KSM, smaps_rollup will be slow, fall back to RSS
> + if ($ksm_pages && $ksm_pages > 1024 * 1024 / 4) {
Hmm, can lead to sudden "jumps" in the rrd metrics data, but that's rather
independent from the decision expression but always the case if we switch
between that. Dropping this stat again completely could be also an option..
A middle ground could be to just display it for the live view with such a
heuristic as proposed here.
> + $filename = 'status';
> + $extract_usage_re = qr/^VmRSS:\s+([0-9]+) kB$/;
> + }
> +
> + open(my $pid_fh, '<', "/proc/${pid}/${filename}")
> or $!{ENOENT}
> - or die "failed to open PSS memory-stat from process - $!\n";
> - next if !defined($smaps_fh);
> + or die "failed to open /proc/${pid}/${filename} - $!\n";
> + next if !defined($pid_fh);
>
> - while (my $line = <$smaps_fh>) {
> - if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
> + while (my $line = <$pid_fh>) {
> + if ($line =~ $extract_usage_re) {
> $memory_usage += int($1) * 1024;
> last; # end inner while loop, go to next $pid
> }
> }
> - close $smaps_fh;
> + close $pid_fh;
> }
> close($procs_fh);
> }
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
next prev parent reply other threads:[~2025-11-25 14:08 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-25 13:51 Fabian Grünbichler
2025-11-25 14:08 ` Thomas Lamprecht [this message]
2025-11-25 14:20 ` Fabian Grünbichler
2025-11-25 15:21 ` Thomas Lamprecht
2025-11-25 17:21 ` Aaron Lauterer
2025-11-25 18:17 ` Thomas Lamprecht
2025-11-25 14:53 ` Aaron Lauterer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7fe6a9b5-3950-4195-9aaf-8112fcf68aa4@proxmox.com \
--to=t.lamprecht@proxmox.com \
--cc=f.gruenbichler@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox