Re: [pve-devel] [PATCH qemu-server] fix #6935: vmstatus: use CGroup for host memory usage

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH qemu-server] fix #6935: vmstatus: use CGroup for host memory usage
Date: Wed, 17 Dec 2025 10:36:34 +0100	[thread overview]
Message-ID: <1765964137.srayifogt6.astroid@yuna.none> (raw)
In-Reply-To: <20251128103639.446372-1-f.gruenbichler@proxmox.com>

ping - we get semi-regular reports of people running into this since
upgrading..

On November 28, 2025 11:36 am, Fabian Grünbichler wrote:
> after a certain amount of KSM sharing, PSS lookups become prohibitively
> expensive. instead of reverting to the old broken method, simply use the
> cgroup's memory usage as `memhost` value.
> 
> this does not account for merged pages because of KSM anymore.
> 
> I benchmarked this with 4 VMs running with different levels of KSM sharing. in
> the output below, "merged pages" refers to the contents of
> /proc/$pid/ksm_merging_pages, the extract_* benchmark runs refer to four
> different variants of extracting memory usage, with the actual extraction part
> running 1000x in a loop for each run to amortize perl/process setup costs,
> qm_status_stock is `qm status $vmid --verbose`, and qm_status_pateched is `perl
> -I./src/PVE ./src/bin/qm status $vmid --verbose` with this patch applied.
> 
> the variants:
> - extract_pss: status before this patch, query smaps_rollup for each process
>   that is part of the qemu.slice of the VM
> - extract_rss: extract VmRSS from the `/proc/$pid/status` file of the main
>   process
> - extract_rss_cgroup: like _rss, but for each process of the slice
> - extract_cgroup: use PVE::QemuServer::CGroup get_memory_stat (this patch)
> 
> first, with no KSM active
> 
> VMID: 113
> 
> pss:        724971520
> rss:        733282304
> cgroup:     727617536
> rss_cgroup: 733282304
> 
> Benchmark 1: extract_pss        271.2 ms ±   6.0 ms    [User: 226.3 ms, System: 44.7 ms]
> Benchmark 2: extract_rss        267.8 ms ±   3.6 ms    [User: 223.9 ms, System: 43.7 ms]
> Benchmark 3: extract_cgroup     273.5 ms ±   6.2 ms    [User: 227.2 ms, System: 46.2 ms]
> Benchmark 4: extract_rss_cgroup 270.5 ms ±   3.7 ms    [User: 225.0 ms, System: 45.3 ms]
> 
> both reported usage and runtime in the same ballpark
> 
> VMID: 838383 (with 48G of memory):
> 
> pss:        40561564672
> rss:        40566108160
> cgroup:     40961339392
> rss-cgroup: 40572141568
> 
> usage in the same ballpark
> 
> Benchmark 1: extract_pss        732.0 ms ±   4.4 ms    [User: 224.8 ms, System: 506.8 ms]
> Benchmark 2: extract_rss        272.1 ms ±   5.2 ms    [User: 227.8 ms, System: 44.0 ms]
> Benchmark 3: extract_cgroup     274.2 ms ±   2.2 ms    [User: 227.8 ms, System: 46.2 ms]
> Benchmark 4: extract_rss_cgroup 270.9 ms ±   3.9 ms    [User: 224.9 ms, System: 45.8 ms]
> 
> but PSS already a lot slower..
> 
> Benchmark 1: qm_status_stock   820.9 ms ±   7.5 ms    [User: 293.1 ms, System: 523.3 ms]
> Benchmark 2: qm_status_patched 356.2 ms ±   5.6 ms    [User: 290.2 ms, System: 61.5 ms]
> 
> which is also visible in the before and after
> 
> the other two VMs behaved as 113
> 
> and now with KSM active
> 
> VMID: 113
> merged pages: 10747 (very little)
> 
> pss:        559815680
> rss:        594853888
> cgroup:     568197120
> rss-cgroup: 594853888
> 
> Benchmark 1: extract_pss        280.0 ms ±   2.4 ms    [User: 229.5 ms, System: 50.2 ms]
> Benchmark 2: extract_rss        274.8 ms ±   3.7 ms    [User: 225.9 ms, System: 48.7 ms]
> Benchmark 3: extract_cgroup     279.0 ms ±   4.6 ms    [User: 228.0 ms, System: 50.7 ms]
> Benchmark 4: extract_rss_cgroup 274.7 ms ±   6.7 ms    [User: 228.0 ms, System: 46.4 ms]
> 
> still same ball park
> 
> VMID: 838383 (with 48G of memory)
> merged pages: 6696434 (a lot - this is 25G worth of pages!)
> 
> pss:        12411169792
> rss:        38772117504
> cgroup:     12799062016
> rss-cgroup: 38778150912
> 
> RSS based are roughly the same, but cgroup gives us almost the same numbers as
> PSS despite KSM being active!
> 
> Benchmark 1: extract_pss        691.7 ms ±   3.4 ms    [User: 225.5 ms, System: 465.8 ms]
> Benchmark 2: extract_rss        276.3 ms ±   7.1 ms    [User: 227.4 ms, System: 48.6 ms]
> Benchmark 3: extract_cgroup     277.8 ms ±   4.4 ms    [User: 228.5 ms, System: 49.1 ms]
> Benchmark 4: extract_rss_cgroup 274.7 ms ±   3.5 ms    [User: 226.6 ms, System: 47.8 ms]
> 
> but it is still fast!
> 
> Benchmark 1: qm_status_stock   771.8 ms ±   7.2 ms    [User: 296.0 ms, System: 471.0 ms]
> Benchmark 2: qm_status_patched 360.2 ms ±   5.1 ms    [User: 287.1 ms, System: 68.5 ms]
> 
> confirmed by `qm status` as well
> 
> VMID: 838384
> merged pages: 165540 (little, this is about 645MB worth of pages)
> 
> pss:        2522527744
> rss:        2927058944
> cgroup:     2500329472
> rss-cgroup: 2932944896
> 
> Benchmark 1: extract_pss        318.4 ms ±   3.6 ms    [User: 227.3 ms, System: 90.8 ms]
> Benchmark 2: extract_rss        273.9 ms ±   5.8 ms    [User: 226.5 ms, System: 47.2 ms]
> Benchmark 3: extract_cgroup     276.3 ms ±   4.1 ms    [User: 225.4 ms, System: 50.7 ms]
> Benchmark 4: extract_rss_cgroup 276.5 ms ±   8.6 ms    [User: 226.1 ms, System: 50.1 ms]
> 
> Benchmark 1: qm_status_stock   400.2 ms ±   6.6 ms    [User: 292.1 ms, System: 103.5 ms]
> Benchmark 2: qm_status_patched 357.0 ms ±   4.1 ms    [User: 288.7 ms, System: 63.7 ms]
> 
> results match those of 838383, just with less effect
> 
> the fourth VM matches this as well.
> 
> Fixes/Reverts: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> 
> Notes:
>     given the numbers, going with the CGroup-based approach seems best - it gives
>     us accurate numbers without the slowdown, and gives users an insight into how
>     KSM affects their guests host memory usage without flip-flopping.
> 
>  src/PVE/QemuServer.pm | 35 ++++-------------------------------
>  1 file changed, 4 insertions(+), 31 deletions(-)
> 
> diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
> index a7fbec14..62d835a5 100644
> --- a/src/PVE/QemuServer.pm
> +++ b/src/PVE/QemuServer.pm
> @@ -2324,35 +2324,6 @@ sub vzlist {
>      return $vzlist;
>  }
>  
> -# Iterate over all PIDs inside a VMID's cgroup slice and accumulate their PSS (proportional set
> -# size) to get a relatively telling effective memory usage of all processes involved with a VM.
> -my sub get_vmid_total_cgroup_memory_usage {
> -    my ($vmid) = @_;
> -
> -    my $memory_usage = 0;
> -    if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
> -        while (my $pid = <$procs_fh>) {
> -            chomp($pid);
> -
> -            open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
> -                or $!{ENOENT}
> -                or die "failed to open PSS memory-stat from process - $!\n";
> -            next if !defined($smaps_fh);
> -
> -            while (my $line = <$smaps_fh>) {
> -                if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
> -                    $memory_usage += int($1) * 1024;
> -                    last; # end inner while loop, go to next $pid
> -                }
> -            }
> -            close $smaps_fh;
> -        }
> -        close($procs_fh);
> -    }
> -
> -    return $memory_usage;
> -}
> -
>  our $vmstatus_return_properties = {
>      vmid => get_standard_option('pve-vmid'),
>      status => {
> @@ -2614,9 +2585,11 @@ sub vmstatus {
>  
>          $d->{uptime} = int(($uptime - $pstat->{starttime}) / $cpuinfo->{user_hz});
>  
> -        $d->{memhost} = get_vmid_total_cgroup_memory_usage($vmid);
> +        my $cgroup = PVE::QemuServer::CGroup->new($vmid);
> +        my $cgroup_mem = $cgroup->get_memory_stat();
> +        $d->{memhost} = $cgroup_mem->{mem} // 0;
>  
> -        $d->{mem} = $d->{memhost}; # default to cgroup PSS sum, balloon info can override this below
> +        $d->{mem} = $d->{memhost}; # default to cgroup, balloon info can override this below
>  
>          my $pressures = PVE::ProcFSTools::read_cgroup_pressure("qemu.slice/${vmid}.scope");
>          $d->{pressurecpusome} = $pressures->{cpu}->{some}->{avg10} * 1;
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel