From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: [pve-devel] superseded: [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
Date: Fri, 28 Nov 2025 11:37:08 +0100 [thread overview]
Message-ID: <1764326216.apunwxgong.astroid@yuna.none> (raw)
In-Reply-To: <20251125135107.561633-1-f.gruenbichler@proxmox.com>
https://lore.proxmox.com/pve-devel/20251128103639.446372-1-f.gruenbichler@proxmox.com/
On November 25, 2025 2:51 pm, Fabian Grünbichler wrote:
> after a certain amount of KSM sharing, PSS lookups become prohibitively
> expensive. fallback to RSS (which was used before) in that case, to avoid
> vmstatus calls blocking for long periods of time.
>
> I benchmarked this with 3 VMs running with different levels of KSM sharing. in
> the output below, "merged pages" refers to the contents of
> /proc/$pid/ksm_merging_pages, extract_pss is the parsing code for cumulative
> PSS of a VM cgroup isolated, extract_rss is the parsing code for cumulative RSS
> of a VM cgroup isolated, qm_status_stock is `qm status $vmid --verbose`, and
> qm_status_patched is `perl -I./src/PVE ./src/bin/qm status $vmid --verbose`
> with this patch applied.
>
> first, a VM with barely any sharing:
>
> merged pages: 1574
>
> Benchmark 1: extract_pss
> Time (mean ± σ): 15.0 ms ± 0.6 ms [User: 4.2 ms, System: 10.8 ms]
> Range (min … max): 14.1 ms … 17.0 ms 173 runs
>
> Benchmark 2: extract_rss
> Time (mean ± σ): 5.8 ms ± 0.3 ms [User: 4.3 ms, System: 1.5 ms]
> Range (min … max): 5.3 ms … 7.7 ms 466 runs
>
> Summary
> extract_rss ran
> 2.56 ± 0.16 times faster than extract_pss
>
> Benchmark 1: qm_status_stock
> Time (mean ± σ): 363.5 ms ± 5.6 ms [User: 290.8 ms, System: 68.5 ms]
> Range (min … max): 353.1 ms … 370.4 ms 10 runs
>
> Benchmark 2: qm_status_patched
> Time (mean ± σ): 360.6 ms ± 4.2 ms [User: 285.4 ms, System: 71.0 ms]
> Range (min … max): 355.0 ms … 366.5 ms 10 runs
>
> Summary
> qm_status_patched ran
> 1.01 ± 0.02 times faster than qm_status_stock
>
> shows very little difference in total status runtime.
>
> next, a VM with modest sharing:
>
> merged pages: 52118
>
> Benchmark 1: extract_pss
> Time (mean ± σ): 57.1 ms ± 1.3 ms [User: 4.3 ms, System: 52.8 ms]
> Range (min … max): 54.6 ms … 60.4 ms 50 runs
>
> Benchmark 2: extract_rss
> Time (mean ± σ): 6.0 ms ± 0.3 ms [User: 4.3 ms, System: 1.6 ms]
> Range (min … max): 5.4 ms … 6.9 ms 464 runs
>
> Summary
> extract_rss ran
> 9.60 ± 0.52 times faster than extract_pss
>
> Benchmark 1: qm_status_stock
> Time (mean ± σ): 407.9 ms ± 5.9 ms [User: 288.2 ms, System: 115.0 ms]
> Range (min … max): 402.2 ms … 419.3 ms 10 runs
>
> Benchmark 2: qm_status_patched
> Time (mean ± σ): 412.9 ms ± 7.6 ms [User: 294.4 ms, System: 113.9 ms]
> Range (min … max): 405.9 ms … 425.8 ms 10 runs
>
> Summary
> qm_status_stock ran
> 1.01 ± 0.02 times faster than qm_status_patched
>
> while the stat extraction alone would be a lot faster via RSS, the total status
> runtime is still a lot bigger (the patched `qm status` will still use PSS in
> this case!).
>
> and now a VM with the problematic behaviour caused by lots of sharing (~12GB):
>
> merged pages: 3095741
>
> Benchmark 1: extract_pss
> Time (mean ± σ): 583.2 ms ± 4.6 ms [User: 3.9 ms, System: 579.1 ms]
> Range (min … max): 573.9 ms … 591.7 ms 10 runs
>
> Benchmark 2: extract_rss
> Time (mean ± σ): 6.0 ms ± 0.3 ms [User: 4.2 ms, System: 1.7 ms]
> Range (min … max): 5.4 ms … 7.3 ms 412 runs
>
> Summary
> extract_rss ran
> 97.66 ± 5.00 times faster than extract_pss
>
> extraction via PSS alone is now slower than the whole status call with RSS:
>
> Benchmark 1: qm_status_stock
> Time (mean ± σ): 935.5 ms ± 8.4 ms [User: 292.2 ms, System: 638.6 ms]
> Range (min … max): 924.8 ms … 952.0 ms 10 runs
>
> Benchmark 2: qm_status_patched
> Time (mean ± σ): 359.9 ms ± 7.6 ms [User: 295.1 ms, System: 60.3 ms]
> Range (min … max): 350.1 ms … 371.3 ms 10 runs
>
> Summary
> qm_status_patched ran
> 2.60 ± 0.06 times faster than qm_status_stock
>
> Fixes: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")
>
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
>
> Notes:
> the threshold is a bit arbitrary, we could also consider setting it
> lower to be on the safe side, or make it relative to the total
> number of pages of memory..
>
> one issue with this approach is that if KSM is disabled later on and
> all the merging is undone, the problematic behaviour remains, and
> there is - AFAICT - no trace of this state in `ksm_stat` of the
> process or elsewhere. the behaviour goes away if the VM is stopped
> and started again. instead of doing a per-pid decision, we might
> want to opt for setting a global RSS fallback in case KSM is
> detected as active on the host?
>
> we should of course also investigate further whether this is fixable
> or improvable on the kernel side..
>
> src/PVE/QemuServer.pm | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
> index a7fbec14..82e9c004 100644
> --- a/src/PVE/QemuServer.pm
> +++ b/src/PVE/QemuServer.pm
> @@ -2333,19 +2333,28 @@ my sub get_vmid_total_cgroup_memory_usage {
> if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
> while (my $pid = <$procs_fh>) {
> chomp($pid);
> + my $filename = 'smaps_rollup';
> + my $extract_usage_re = qr/^Pss:\s+([0-9]+) kB$/;
>
> - open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
> + my $ksm_pages = PVE::Tools::file_read_firstline("/proc/$pid/ksm_merging_pages");
> + # more than 1G shared via KSM, smaps_rollup will be slow, fall back to RSS
> + if ($ksm_pages && $ksm_pages > 1024 * 1024 / 4) {
> + $filename = 'status';
> + $extract_usage_re = qr/^VmRSS:\s+([0-9]+) kB$/;
> + }
> +
> + open(my $pid_fh, '<', "/proc/${pid}/${filename}")
> or $!{ENOENT}
> - or die "failed to open PSS memory-stat from process - $!\n";
> - next if !defined($smaps_fh);
> + or die "failed to open /proc/${pid}/${filename} - $!\n";
> + next if !defined($pid_fh);
>
> - while (my $line = <$smaps_fh>) {
> - if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
> + while (my $line = <$pid_fh>) {
> + if ($line =~ $extract_usage_re) {
> $memory_usage += int($1) * 1024;
> last; # end inner while loop, go to next $pid
> }
> }
> - close $smaps_fh;
> + close $pid_fh;
> }
> close($procs_fh);
> }
> --
> 2.47.3
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
prev parent reply other threads:[~2025-11-28 10:37 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-25 13:51 [pve-devel] " Fabian Grünbichler
2025-11-25 14:08 ` Thomas Lamprecht
2025-11-25 14:20 ` Fabian Grünbichler
2025-11-25 15:21 ` Thomas Lamprecht
2025-11-25 17:21 ` Aaron Lauterer
2025-11-25 18:17 ` Thomas Lamprecht
2025-11-26 8:27 ` Fabian Grünbichler
2025-11-25 14:53 ` Aaron Lauterer
2025-11-28 10:37 ` Fabian Grünbichler [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1764326216.apunwxgong.astroid@yuna.none \
--to=f.gruenbichler@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox