[pve-devel] superseded: [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: [pve-devel] superseded: [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
Date: Fri, 28 Nov 2025 11:37:08 +0100	[thread overview]
Message-ID: <1764326216.apunwxgong.astroid@yuna.none> (raw)
In-Reply-To: <20251125135107.561633-1-f.gruenbichler@proxmox.com>

https://lore.proxmox.com/pve-devel/20251128103639.446372-1-f.gruenbichler@proxmox.com/

On November 25, 2025 2:51 pm, Fabian Grünbichler wrote:
> after a certain amount of KSM sharing, PSS lookups become prohibitively
> expensive. fallback to RSS (which was used before) in that case, to avoid
> vmstatus calls blocking for long periods of time.
> 
> I benchmarked this with 3 VMs running with different levels of KSM sharing. in
> the output below, "merged pages" refers to the contents of
> /proc/$pid/ksm_merging_pages, extract_pss is the parsing code for cumulative
> PSS of a VM cgroup isolated, extract_rss is the parsing code for cumulative RSS
> of a VM cgroup isolated, qm_status_stock is `qm status $vmid --verbose`, and
> qm_status_patched is `perl -I./src/PVE ./src/bin/qm status $vmid --verbose`
> with this patch applied.
> 
> first, a VM with barely any sharing:
> 
> merged pages: 1574
> 
> Benchmark 1: extract_pss
>   Time (mean ± σ):      15.0 ms ±   0.6 ms    [User: 4.2 ms, System: 10.8 ms]
>   Range (min … max):    14.1 ms …  17.0 ms    173 runs
> 
> Benchmark 2: extract_rss
>   Time (mean ± σ):       5.8 ms ±   0.3 ms    [User: 4.3 ms, System: 1.5 ms]
>   Range (min … max):     5.3 ms …   7.7 ms    466 runs
> 
> Summary
>   extract_rss ran
>     2.56 ± 0.16 times faster than extract_pss
> 
> Benchmark 1: qm_status_stock
>   Time (mean ± σ):     363.5 ms ±   5.6 ms    [User: 290.8 ms, System: 68.5 ms]
>   Range (min … max):   353.1 ms … 370.4 ms    10 runs
> 
> Benchmark 2: qm_status_patched
>   Time (mean ± σ):     360.6 ms ±   4.2 ms    [User: 285.4 ms, System: 71.0 ms]
>   Range (min … max):   355.0 ms … 366.5 ms    10 runs
> 
> Summary
>   qm_status_patched ran
>     1.01 ± 0.02 times faster than qm_status_stock
> 
> shows very little difference in total status runtime.
> 
> next, a VM with modest sharing:
> 
> merged pages: 52118
> 
> Benchmark 1: extract_pss
>   Time (mean ± σ):      57.1 ms ±   1.3 ms    [User: 4.3 ms, System: 52.8 ms]
>   Range (min … max):    54.6 ms …  60.4 ms    50 runs
> 
> Benchmark 2: extract_rss
>   Time (mean ± σ):       6.0 ms ±   0.3 ms    [User: 4.3 ms, System: 1.6 ms]
>   Range (min … max):     5.4 ms …   6.9 ms    464 runs
> 
> Summary
>   extract_rss ran
>     9.60 ± 0.52 times faster than extract_pss
> 
> Benchmark 1: qm_status_stock
>   Time (mean ± σ):     407.9 ms ±   5.9 ms    [User: 288.2 ms, System: 115.0 ms]
>   Range (min … max):   402.2 ms … 419.3 ms    10 runs
> 
> Benchmark 2: qm_status_patched
>   Time (mean ± σ):     412.9 ms ±   7.6 ms    [User: 294.4 ms, System: 113.9 ms]
>   Range (min … max):   405.9 ms … 425.8 ms    10 runs
> 
> Summary
>   qm_status_stock ran
>     1.01 ± 0.02 times faster than qm_status_patched
> 
> while the stat extraction alone would be a lot faster via RSS, the total status
> runtime is still a lot bigger (the patched `qm status` will still use PSS in
> this case!).
> 
> and now a VM with the problematic behaviour caused by lots of sharing (~12GB):
> 
> merged pages: 3095741
> 
> Benchmark 1: extract_pss
>   Time (mean ± σ):     583.2 ms ±   4.6 ms    [User: 3.9 ms, System: 579.1 ms]
>   Range (min … max):   573.9 ms … 591.7 ms    10 runs
> 
> Benchmark 2: extract_rss
>   Time (mean ± σ):       6.0 ms ±   0.3 ms    [User: 4.2 ms, System: 1.7 ms]
>   Range (min … max):     5.4 ms …   7.3 ms    412 runs
> 
> Summary
>   extract_rss ran
>    97.66 ± 5.00 times faster than extract_pss
> 
> extraction via PSS alone is now slower than the whole status call with RSS:
> 
> Benchmark 1: qm_status_stock
>   Time (mean ± σ):     935.5 ms ±   8.4 ms    [User: 292.2 ms, System: 638.6 ms]
>   Range (min … max):   924.8 ms … 952.0 ms    10 runs
> 
> Benchmark 2: qm_status_patched
>   Time (mean ± σ):     359.9 ms ±   7.6 ms    [User: 295.1 ms, System: 60.3 ms]
>   Range (min … max):   350.1 ms … 371.3 ms    10 runs
> 
> Summary
>   qm_status_patched ran
>     2.60 ± 0.06 times faster than qm_status_stock
> 
> Fixes: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> 
> Notes:
>     the threshold is a bit arbitrary, we could also consider setting it
>     lower to be on the safe side, or make it relative to the total
>     number of pages of memory..
>     
>     one issue with this approach is that if KSM is disabled later on and
>     all the merging is undone, the problematic behaviour remains, and
>     there is - AFAICT - no trace of this state in `ksm_stat` of the
>     process or elsewhere. the behaviour goes away if the VM is stopped
>     and started again. instead of doing a per-pid decision, we might
>     want to opt for setting a global RSS fallback in case KSM is
>     detected as active on the host?
> 
>     we should of course also investigate further whether this is fixable
>     or improvable on the kernel side..
> 
>  src/PVE/QemuServer.pm | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
> index a7fbec14..82e9c004 100644
> --- a/src/PVE/QemuServer.pm
> +++ b/src/PVE/QemuServer.pm
> @@ -2333,19 +2333,28 @@ my sub get_vmid_total_cgroup_memory_usage {
>      if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
>          while (my $pid = <$procs_fh>) {
>              chomp($pid);
> +            my $filename = 'smaps_rollup';
> +            my $extract_usage_re = qr/^Pss:\s+([0-9]+) kB$/;
>  
> -            open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
> +            my $ksm_pages = PVE::Tools::file_read_firstline("/proc/$pid/ksm_merging_pages");
> +            # more than 1G shared via KSM, smaps_rollup will be slow, fall back to RSS
> +            if ($ksm_pages && $ksm_pages > 1024 * 1024 / 4) {
> +                $filename = 'status';
> +                $extract_usage_re = qr/^VmRSS:\s+([0-9]+) kB$/;
> +            }
> +
> +            open(my $pid_fh, '<', "/proc/${pid}/${filename}")
>                  or $!{ENOENT}
> -                or die "failed to open PSS memory-stat from process - $!\n";
> -            next if !defined($smaps_fh);
> +                or die "failed to open /proc/${pid}/${filename} - $!\n";
> +            next if !defined($pid_fh);
>  
> -            while (my $line = <$smaps_fh>) {
> -                if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
> +            while (my $line = <$pid_fh>) {
> +                if ($line =~ $extract_usage_re) {
>                      $memory_usage += int($1) * 1024;
>                      last; # end inner while loop, go to next $pid
>                  }
>              }
> -            close $smaps_fh;
> +            close $pid_fh;
>          }
>          close($procs_fh);
>      }
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel