public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
@ 2025-11-25 13:51 Fabian Grünbichler
  2025-11-25 14:08 ` Thomas Lamprecht
  2025-11-25 14:53 ` Aaron Lauterer
  0 siblings, 2 replies; 7+ messages in thread
From: Fabian Grünbichler @ 2025-11-25 13:51 UTC (permalink / raw)
  To: pve-devel

after a certain amount of KSM sharing, PSS lookups become prohibitively
expensive. fallback to RSS (which was used before) in that case, to avoid
vmstatus calls blocking for long periods of time.

I benchmarked this with 3 VMs running with different levels of KSM sharing. in
the output below, "merged pages" refers to the contents of
/proc/$pid/ksm_merging_pages, extract_pss is the parsing code for cumulative
PSS of a VM cgroup isolated, extract_rss is the parsing code for cumulative RSS
of a VM cgroup isolated, qm_status_stock is `qm status $vmid --verbose`, and
qm_status_patched is `perl -I./src/PVE ./src/bin/qm status $vmid --verbose`
with this patch applied.

first, a VM with barely any sharing:

merged pages: 1574

Benchmark 1: extract_pss
  Time (mean ± σ):      15.0 ms ±   0.6 ms    [User: 4.2 ms, System: 10.8 ms]
  Range (min … max):    14.1 ms …  17.0 ms    173 runs

Benchmark 2: extract_rss
  Time (mean ± σ):       5.8 ms ±   0.3 ms    [User: 4.3 ms, System: 1.5 ms]
  Range (min … max):     5.3 ms …   7.7 ms    466 runs

Summary
  extract_rss ran
    2.56 ± 0.16 times faster than extract_pss

Benchmark 1: qm_status_stock
  Time (mean ± σ):     363.5 ms ±   5.6 ms    [User: 290.8 ms, System: 68.5 ms]
  Range (min … max):   353.1 ms … 370.4 ms    10 runs

Benchmark 2: qm_status_patched
  Time (mean ± σ):     360.6 ms ±   4.2 ms    [User: 285.4 ms, System: 71.0 ms]
  Range (min … max):   355.0 ms … 366.5 ms    10 runs

Summary
  qm_status_patched ran
    1.01 ± 0.02 times faster than qm_status_stock

shows very little difference in total status runtime.

next, a VM with modest sharing:

merged pages: 52118

Benchmark 1: extract_pss
  Time (mean ± σ):      57.1 ms ±   1.3 ms    [User: 4.3 ms, System: 52.8 ms]
  Range (min … max):    54.6 ms …  60.4 ms    50 runs

Benchmark 2: extract_rss
  Time (mean ± σ):       6.0 ms ±   0.3 ms    [User: 4.3 ms, System: 1.6 ms]
  Range (min … max):     5.4 ms …   6.9 ms    464 runs

Summary
  extract_rss ran
    9.60 ± 0.52 times faster than extract_pss

Benchmark 1: qm_status_stock
  Time (mean ± σ):     407.9 ms ±   5.9 ms    [User: 288.2 ms, System: 115.0 ms]
  Range (min … max):   402.2 ms … 419.3 ms    10 runs

Benchmark 2: qm_status_patched
  Time (mean ± σ):     412.9 ms ±   7.6 ms    [User: 294.4 ms, System: 113.9 ms]
  Range (min … max):   405.9 ms … 425.8 ms    10 runs

Summary
  qm_status_stock ran
    1.01 ± 0.02 times faster than qm_status_patched

while the stat extraction alone would be a lot faster via RSS, the total status
runtime is still a lot bigger (the patched `qm status` will still use PSS in
this case!).

and now a VM with the problematic behaviour caused by lots of sharing (~12GB):

merged pages: 3095741

Benchmark 1: extract_pss
  Time (mean ± σ):     583.2 ms ±   4.6 ms    [User: 3.9 ms, System: 579.1 ms]
  Range (min … max):   573.9 ms … 591.7 ms    10 runs

Benchmark 2: extract_rss
  Time (mean ± σ):       6.0 ms ±   0.3 ms    [User: 4.2 ms, System: 1.7 ms]
  Range (min … max):     5.4 ms …   7.3 ms    412 runs

Summary
  extract_rss ran
   97.66 ± 5.00 times faster than extract_pss

extraction via PSS alone is now slower than the whole status call with RSS:

Benchmark 1: qm_status_stock
  Time (mean ± σ):     935.5 ms ±   8.4 ms    [User: 292.2 ms, System: 638.6 ms]
  Range (min … max):   924.8 ms … 952.0 ms    10 runs

Benchmark 2: qm_status_patched
  Time (mean ± σ):     359.9 ms ±   7.6 ms    [User: 295.1 ms, System: 60.3 ms]
  Range (min … max):   350.1 ms … 371.3 ms    10 runs

Summary
  qm_status_patched ran
    2.60 ± 0.06 times faster than qm_status_stock

Fixes: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---

Notes:
    the threshold is a bit arbitrary, we could also consider setting it
    lower to be on the safe side, or make it relative to the total
    number of pages of memory..
    
    one issue with this approach is that if KSM is disabled later on and
    all the merging is undone, the problematic behaviour remains, and
    there is - AFAICT - no trace of this state in `ksm_stat` of the
    process or elsewhere. the behaviour goes away if the VM is stopped
    and started again. instead of doing a per-pid decision, we might
    want to opt for setting a global RSS fallback in case KSM is
    detected as active on the host?

    we should of course also investigate further whether this is fixable
    or improvable on the kernel side..

 src/PVE/QemuServer.pm | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index a7fbec14..82e9c004 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -2333,19 +2333,28 @@ my sub get_vmid_total_cgroup_memory_usage {
     if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
         while (my $pid = <$procs_fh>) {
             chomp($pid);
+            my $filename = 'smaps_rollup';
+            my $extract_usage_re = qr/^Pss:\s+([0-9]+) kB$/;
 
-            open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
+            my $ksm_pages = PVE::Tools::file_read_firstline("/proc/$pid/ksm_merging_pages");
+            # more than 1G shared via KSM, smaps_rollup will be slow, fall back to RSS
+            if ($ksm_pages && $ksm_pages > 1024 * 1024 / 4) {
+                $filename = 'status';
+                $extract_usage_re = qr/^VmRSS:\s+([0-9]+) kB$/;
+            }
+
+            open(my $pid_fh, '<', "/proc/${pid}/${filename}")
                 or $!{ENOENT}
-                or die "failed to open PSS memory-stat from process - $!\n";
-            next if !defined($smaps_fh);
+                or die "failed to open /proc/${pid}/${filename} - $!\n";
+            next if !defined($pid_fh);
 
-            while (my $line = <$smaps_fh>) {
-                if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
+            while (my $line = <$pid_fh>) {
+                if ($line =~ $extract_usage_re) {
                     $memory_usage += int($1) * 1024;
                     last; # end inner while loop, go to next $pid
                 }
             }
-            close $smaps_fh;
+            close $pid_fh;
         }
         close($procs_fh);
     }
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
  2025-11-25 13:51 [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage Fabian Grünbichler
@ 2025-11-25 14:08 ` Thomas Lamprecht
  2025-11-25 14:20   ` Fabian Grünbichler
  2025-11-25 14:53 ` Aaron Lauterer
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Lamprecht @ 2025-11-25 14:08 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Grünbichler

Am 25.11.25 um 14:51 schrieb Fabian Grünbichler:
> after a certain amount of KSM sharing, PSS lookups become prohibitively
> expensive. fallback to RSS (which was used before) in that case, to avoid
> vmstatus calls blocking for long periods of time.
> 
> I benchmarked this with 3 VMs running with different levels of KSM sharing. in
> the output below, "merged pages" refers to the contents of
> /proc/$pid/ksm_merging_pages, extract_pss is the parsing code for cumulative
> PSS of a VM cgroup isolated, extract_rss is the parsing code for cumulative RSS
> of a VM cgroup isolated, qm_status_stock is `qm status $vmid --verbose`, and
> qm_status_patched is `perl -I./src/PVE ./src/bin/qm status $vmid --verbose`
> with this patch applied.
> 
> first, a VM with barely any sharing:
> 
> merged pages: 1574
> 
> Benchmark 1: extract_pss
>   Time (mean ± σ):      15.0 ms ±   0.6 ms    [User: 4.2 ms, System: 10.8 ms]
>   Range (min … max):    14.1 ms …  17.0 ms    173 runs
> 
> Benchmark 2: extract_rss
>   Time (mean ± σ):       5.8 ms ±   0.3 ms    [User: 4.3 ms, System: 1.5 ms]
>   Range (min … max):     5.3 ms …   7.7 ms    466 runs
> 
> Summary
>   extract_rss ran
>     2.56 ± 0.16 times faster than extract_pss
> 
> Benchmark 1: qm_status_stock
>   Time (mean ± σ):     363.5 ms ±   5.6 ms    [User: 290.8 ms, System: 68.5 ms]
>   Range (min … max):   353.1 ms … 370.4 ms    10 runs
> 
> Benchmark 2: qm_status_patched
>   Time (mean ± σ):     360.6 ms ±   4.2 ms    [User: 285.4 ms, System: 71.0 ms]
>   Range (min … max):   355.0 ms … 366.5 ms    10 runs
> 
> Summary
>   qm_status_patched ran
>     1.01 ± 0.02 times faster than qm_status_stock
> 
> shows very little difference in total status runtime.
> 
> next, a VM with modest sharing:
> 
> merged pages: 52118
> 
> Benchmark 1: extract_pss
>   Time (mean ± σ):      57.1 ms ±   1.3 ms    [User: 4.3 ms, System: 52.8 ms]
>   Range (min … max):    54.6 ms …  60.4 ms    50 runs
> 
> Benchmark 2: extract_rss
>   Time (mean ± σ):       6.0 ms ±   0.3 ms    [User: 4.3 ms, System: 1.6 ms]
>   Range (min … max):     5.4 ms …   6.9 ms    464 runs
> 
> Summary
>   extract_rss ran
>     9.60 ± 0.52 times faster than extract_pss
> 
> Benchmark 1: qm_status_stock
>   Time (mean ± σ):     407.9 ms ±   5.9 ms    [User: 288.2 ms, System: 115.0 ms]
>   Range (min … max):   402.2 ms … 419.3 ms    10 runs
> 
> Benchmark 2: qm_status_patched
>   Time (mean ± σ):     412.9 ms ±   7.6 ms    [User: 294.4 ms, System: 113.9 ms]
>   Range (min … max):   405.9 ms … 425.8 ms    10 runs
> 
> Summary
>   qm_status_stock ran
>     1.01 ± 0.02 times faster than qm_status_patched
> 
> while the stat extraction alone would be a lot faster via RSS, the total status
> runtime is still a lot bigger (the patched `qm status` will still use PSS in
> this case!).
> 
> and now a VM with the problematic behaviour caused by lots of sharing (~12GB):
> 
> merged pages: 3095741
> 
> Benchmark 1: extract_pss
>   Time (mean ± σ):     583.2 ms ±   4.6 ms    [User: 3.9 ms, System: 579.1 ms]
>   Range (min … max):   573.9 ms … 591.7 ms    10 runs
> 
> Benchmark 2: extract_rss
>   Time (mean ± σ):       6.0 ms ±   0.3 ms    [User: 4.2 ms, System: 1.7 ms]
>   Range (min … max):     5.4 ms …   7.3 ms    412 runs
> 
> Summary
>   extract_rss ran
>    97.66 ± 5.00 times faster than extract_pss
> 
> extraction via PSS alone is now slower than the whole status call with RSS:
> 
> Benchmark 1: qm_status_stock
>   Time (mean ± σ):     935.5 ms ±   8.4 ms    [User: 292.2 ms, System: 638.6 ms]
>   Range (min … max):   924.8 ms … 952.0 ms    10 runs
> 
> Benchmark 2: qm_status_patched
>   Time (mean ± σ):     359.9 ms ±   7.6 ms    [User: 295.1 ms, System: 60.3 ms]
>   Range (min … max):   350.1 ms … 371.3 ms    10 runs
> 
> Summary
>   qm_status_patched ran
>     2.60 ± 0.06 times faster than qm_status_stock
> 
> Fixes: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> 
> Notes:
>     the threshold is a bit arbitrary, we could also consider setting it
>     lower to be on the safe side, or make it relative to the total
>     number of pages of memory..
>     
>     one issue with this approach is that if KSM is disabled later on and
>     all the merging is undone, the problematic behaviour remains, and
>     there is - AFAICT - no trace of this state in `ksm_stat` of the
>     process or elsewhere. the behaviour goes away if the VM is stopped
>     and started again. instead of doing a per-pid decision, we might
>     want to opt for setting a global RSS fallback in case KSM is
>     detected as active on the host?

One can now also disable KSM per VM, so that config property should be
checked too if we go that route.

> 
>     we should of course also investigate further whether this is fixable
>     or improvable on the kernel side..
> 
>  src/PVE/QemuServer.pm | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
> index a7fbec14..82e9c004 100644
> --- a/src/PVE/QemuServer.pm
> +++ b/src/PVE/QemuServer.pm
> @@ -2333,19 +2333,28 @@ my sub get_vmid_total_cgroup_memory_usage {
>      if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {


Just to be sure: The stats from memory.current or memory.stat inside the
/sys/fs/cgroup/qemu.slice/${vmid}.scope/ directory is definitively not
enough for our usecases?

>          while (my $pid = <$procs_fh>) {
>              chomp($pid);
> +            my $filename = 'smaps_rollup';
> +            my $extract_usage_re = qr/^Pss:\s+([0-9]+) kB$/;
>  
> -            open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
> +            my $ksm_pages = PVE::Tools::file_read_firstline("/proc/$pid/ksm_merging_pages");
> +            # more than 1G shared via KSM, smaps_rollup will be slow, fall back to RSS
> +            if ($ksm_pages && $ksm_pages > 1024 * 1024 / 4) {

Hmm, can lead to sudden "jumps" in the rrd metrics data, but that's rather
independent from the decision expression but always the case if we switch
between that. Dropping this stat again completely could be also an option..
A middle ground could be to just display it for the live view with such a
heuristic as proposed here.

> +                $filename = 'status';
> +                $extract_usage_re = qr/^VmRSS:\s+([0-9]+) kB$/;
> +            }
> +
> +            open(my $pid_fh, '<', "/proc/${pid}/${filename}")
>                  or $!{ENOENT}
> -                or die "failed to open PSS memory-stat from process - $!\n";
> -            next if !defined($smaps_fh);
> +                or die "failed to open /proc/${pid}/${filename} - $!\n";
> +            next if !defined($pid_fh);
>  
> -            while (my $line = <$smaps_fh>) {
> -                if ($line =~ m/^Pss:\s+([0-9]+) kB$/) {
> +            while (my $line = <$pid_fh>) {
> +                if ($line =~ $extract_usage_re) {
>                      $memory_usage += int($1) * 1024;
>                      last; # end inner while loop, go to next $pid
>                  }
>              }
> -            close $smaps_fh;
> +            close $pid_fh;
>          }
>          close($procs_fh);
>      }



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
  2025-11-25 14:08 ` Thomas Lamprecht
@ 2025-11-25 14:20   ` Fabian Grünbichler
  2025-11-25 15:21     ` Thomas Lamprecht
  0 siblings, 1 reply; 7+ messages in thread
From: Fabian Grünbichler @ 2025-11-25 14:20 UTC (permalink / raw)
  To: Proxmox VE development discussion, Thomas Lamprecht

On November 25, 2025 3:08 pm, Thomas Lamprecht wrote:
> Am 25.11.25 um 14:51 schrieb Fabian Grünbichler:
>> after a certain amount of KSM sharing, PSS lookups become prohibitively
>> expensive. fallback to RSS (which was used before) in that case, to avoid
>> vmstatus calls blocking for long periods of time.
>> 
>> I benchmarked this with 3 VMs running with different levels of KSM sharing. in
>> the output below, "merged pages" refers to the contents of
>> /proc/$pid/ksm_merging_pages, extract_pss is the parsing code for cumulative
>> PSS of a VM cgroup isolated, extract_rss is the parsing code for cumulative RSS
>> of a VM cgroup isolated, qm_status_stock is `qm status $vmid --verbose`, and
>> qm_status_patched is `perl -I./src/PVE ./src/bin/qm status $vmid --verbose`
>> with this patch applied.
>>
>> [..]
>> 
>> Fixes: d426de6c7d81a4d04950f2eaa9afe96845d73f7e ("vmstatus: add memhost for host view of vm mem consumption")
>> 
>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>> ---
>> 
>> Notes:
>>     the threshold is a bit arbitrary, we could also consider setting it
>>     lower to be on the safe side, or make it relative to the total
>>     number of pages of memory..
>>     
>>     one issue with this approach is that if KSM is disabled later on and
>>     all the merging is undone, the problematic behaviour remains, and
>>     there is - AFAICT - no trace of this state in `ksm_stat` of the
>>     process or elsewhere. the behaviour goes away if the VM is stopped
>>     and started again. instead of doing a per-pid decision, we might
>>     want to opt for setting a global RSS fallback in case KSM is
>>     detected as active on the host?
> 
> One can now also disable KSM per VM, so that config property should be
> checked too if we go that route.

right, if the running config has that set, we should be allowed to query
PSS.. that actually might be the nicest solution?

>>     we should of course also investigate further whether this is fixable
>>     or improvable on the kernel side..
>> 
>>  src/PVE/QemuServer.pm | 21 +++++++++++++++------
>>  1 file changed, 15 insertions(+), 6 deletions(-)
>> 
>> diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
>> index a7fbec14..82e9c004 100644
>> --- a/src/PVE/QemuServer.pm
>> +++ b/src/PVE/QemuServer.pm
>> @@ -2333,19 +2333,28 @@ my sub get_vmid_total_cgroup_memory_usage {
>>      if (my $procs_fh = IO::File->new("/sys/fs/cgroup/qemu.slice/${vmid}.scope/cgroup.procs", "r")) {
> 
> 
> Just to be sure: The stats from memory.current or memory.stat inside the
> /sys/fs/cgroup/qemu.slice/${vmid}.scope/ directory is definitively not
> enough for our usecases?

well, if we go for RSS they might be, for PSS they are not, since that
doesn't exist there?

> 
>>          while (my $pid = <$procs_fh>) {
>>              chomp($pid);
>> +            my $filename = 'smaps_rollup';
>> +            my $extract_usage_re = qr/^Pss:\s+([0-9]+) kB$/;
>>  
>> -            open(my $smaps_fh, '<', "/proc/${pid}/smaps_rollup")
>> +            my $ksm_pages = PVE::Tools::file_read_firstline("/proc/$pid/ksm_merging_pages");
>> +            # more than 1G shared via KSM, smaps_rollup will be slow, fall back to RSS
>> +            if ($ksm_pages && $ksm_pages > 1024 * 1024 / 4) {
> 
> Hmm, can lead to sudden "jumps" in the rrd metrics data, but that's rather
> independent from the decision expression but always the case if we switch
> between that. Dropping this stat again completely could be also an option..
> A middle ground could be to just display it for the live view with such a
> heuristic as proposed here.

having the live view and the metrics use different semantics seems kinda
confusing tbh..


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
  2025-11-25 13:51 [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage Fabian Grünbichler
  2025-11-25 14:08 ` Thomas Lamprecht
@ 2025-11-25 14:53 ` Aaron Lauterer
  1 sibling, 0 replies; 7+ messages in thread
From: Aaron Lauterer @ 2025-11-25 14:53 UTC (permalink / raw)
  To: Fabian Grünbichler, pve-devel

looks like a sensible stop gap measure IMO. It will make the whole "how 
much memory is being used" even more complex *sigh*

consider this
Reviewed-By: Aaron Lauterer <a.lauterer@proxmox.com>



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
  2025-11-25 14:20   ` Fabian Grünbichler
@ 2025-11-25 15:21     ` Thomas Lamprecht
  2025-11-25 17:21       ` Aaron Lauterer
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Lamprecht @ 2025-11-25 15:21 UTC (permalink / raw)
  To: Fabian Grünbichler, Proxmox VE development discussion

Am 25.11.25 um 15:20 schrieb Fabian Grünbichler:
> On November 25, 2025 3:08 pm, Thomas Lamprecht wrote:
>> Just to be sure: The stats from memory.current or memory.stat inside the
>> /sys/fs/cgroup/qemu.slice/${vmid}.scope/ directory is definitively not
>> enough for our usecases?
> 
> well, if we go for RSS they might be, for PSS they are not, since that
> doesn't exist there?

Would need to take a closer look to tell for sure, but from a quick check
it indeed seems to not be there.

> having the live view and the metrics use different semantics seems kinda
> confusing tbh..

more than jumping between metrics over time silently? ;-) The live view can
be easily annotated with a different label or the like if the source is
another, not so easy for metrics.

The more I think about this the more I'm in favor of just deprecating this
again completely, this page table walking can even cause some latency spikes
in the target process, IMO just not worth it. If the kernel can give us this
free, or at least much cheaper, in the future, then great, but until then it's
not really an option. If, we can make this opt-in. The best granularity here
probably would be through guest config, but for starters a cluster-wide
datacenter option could be already enough for the setups that are fine with
this performance trade-off in general.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
  2025-11-25 15:21     ` Thomas Lamprecht
@ 2025-11-25 17:21       ` Aaron Lauterer
  2025-11-25 18:17         ` Thomas Lamprecht
  0 siblings, 1 reply; 7+ messages in thread
From: Aaron Lauterer @ 2025-11-25 17:21 UTC (permalink / raw)
  To: Proxmox VE development discussion, Thomas Lamprecht,
	Fabian Grünbichler



On  2025-11-25  16:20, Thomas Lamprecht wrote:
> Am 25.11.25 um 15:20 schrieb Fabian Grünbichler:
>> On November 25, 2025 3:08 pm, Thomas Lamprecht wrote:
>>> Just to be sure: The stats from memory.current or memory.stat inside the
>>> /sys/fs/cgroup/qemu.slice/${vmid}.scope/ directory is definitively not
>>> enough for our usecases?
>>
>> well, if we go for RSS they might be, for PSS they are not, since that
>> doesn't exist there?
> 
> Would need to take a closer look to tell for sure, but from a quick check
> it indeed seems to not be there.
> 
>> having the live view and the metrics use different semantics seems kinda
>> confusing tbh..
> 
> more than jumping between metrics over time silently? ;-) The live view can
> be easily annotated with a different label or the like if the source is
> another, not so easy for metrics.
> 
> The more I think about this the more I'm in favor of just deprecating this
> again completely, this page table walking can even cause some latency spikes
> in the target process, IMO just not worth it. If the kernel can give us this
> free, or at least much cheaper, in the future, then great, but until then it's
> not really an option. If, we can make this opt-in. The best granularity here
> probably would be through guest config, but for starters a cluster-wide
> datacenter option could be already enough for the setups that are fine with
> this performance trade-off in general.


If I may add my 2 cents here. How much do we lose by switching 
completely to fetching RSS (or the cgroupv2 equivalent)? For the metrics 
and live view.
AFAIU the resulting memory accounting will be a bit higher, as shared 
libraries will be fully accounted for for each cgroup and not 
proportionally as with PSS.

I am not sure if we want to introduce additional config options (global 
per DC, or per guest) to change the behavior. As that is probably even 
more confusing for not that much gain.

And its not like PSS doesn't come with its own set of weirdness. E.g., 
if we run 10 VMs, and stop all but one, the last will see an increase in 
memory consumption as it is the sole user of shared libraries.

> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage
  2025-11-25 17:21       ` Aaron Lauterer
@ 2025-11-25 18:17         ` Thomas Lamprecht
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Lamprecht @ 2025-11-25 18:17 UTC (permalink / raw)
  To: Proxmox VE development discussion, Aaron Lauterer,
	Fabian Grünbichler

Am 25.11.25 um 18:21 schrieb Aaron Lauterer:
> On  2025-11-25  16:20, Thomas Lamprecht wrote:
>> Am 25.11.25 um 15:20 schrieb Fabian Grünbichler:
>>> On November 25, 2025 3:08 pm, Thomas Lamprecht wrote:
>>>> Just to be sure: The stats from memory.current or memory.stat inside the
>>>> /sys/fs/cgroup/qemu.slice/${vmid}.scope/ directory is definitively not
>>>> enough for our usecases?
>>>
>>> well, if we go for RSS they might be, for PSS they are not, since that
>>> doesn't exist there?
>>
>> Would need to take a closer look to tell for sure, but from a quick check
>> it indeed seems to not be there.
>>
>>> having the live view and the metrics use different semantics seems kinda
>>> confusing tbh..
>>
>> more than jumping between metrics over time silently? ;-) The live view can
>> be easily annotated with a different label or the like if the source is
>> another, not so easy for metrics.
>>
>> The more I think about this the more I'm in favor of just deprecating this
>> again completely, this page table walking can even cause some latency spikes
>> in the target process, IMO just not worth it. If the kernel can give us this
>> free, or at least much cheaper, in the future, then great, but until then it's
>> not really an option. If, we can make this opt-in. The best granularity here
>> probably would be through guest config, but for starters a cluster-wide
>> datacenter option could be already enough for the setups that are fine with
>> this performance trade-off in general.
> 
> 
> If I may add my 2 cents here. How much do we lose by switching completely to fetching RSS (or the cgroupv2 equivalent)? For the metrics and live view.
> AFAIU the resulting memory accounting will be a bit higher, as shared libraries will be fully accounted for for each cgroup and not proportionally as with PSS.

You account shared memory more than once for if a user checks each VM and sums
them up, i.e. the total can even come out for more than installed memory.
IME this confuses people more compared to over time effects, but no hard
feelings here as long as it's clear that it's now ignoring if any memory is
shared between multiple processes/VMs.

> I am not sure if we want to introduce additional config options (global per DC, or per guest) to change the behavior. As that is probably even more confusing for not that much gain.

I have no problem of ripping this out, that was just a proposal for a cheap way
to allow keeping this behavior for those that really want it.

> And its not like PSS doesn't come with its own set of weirdness. E.g., if we run 10 VMs, and stop all but one, the last will see an increase in memory consumption as it is the sole user of shared libraries.

Yes, that's the basic underlying principle and how reality with accounting works.
At any time the value is correct though, while with RSS it's wrong at any time.

For KSM you have already similar effects, from the POV of a VM the memory usage
can stay the same, but due to change what's in the memory the KSM sharing rate
goes down and thus memory usage on the host goes up even if all VMs kept exactly
the same amount of memory in use. If you start to share things the usage stats
will always stop being trivial.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-11-25 18:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-25 13:51 [pve-devel] [RFC qemu-server] fix #6935: vmstatus: fallback to RSS in case of KSM usage Fabian Grünbichler
2025-11-25 14:08 ` Thomas Lamprecht
2025-11-25 14:20   ` Fabian Grünbichler
2025-11-25 15:21     ` Thomas Lamprecht
2025-11-25 17:21       ` Aaron Lauterer
2025-11-25 18:17         ` Thomas Lamprecht
2025-11-25 14:53 ` Aaron Lauterer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal