* [pve-devel] [PATCH] ZFS ARC size not taken into account by pvestatd or ksmtuned
@ 2021-01-19 14:42 Stephane Chazelas
2021-01-23 9:14 ` Bruce Wainer
0 siblings, 1 reply; 3+ messages in thread
From: Stephane Chazelas @ 2021-01-19 14:42 UTC (permalink / raw)
To: pve-devel
[note that I'm not subscribed to the list]
Hello,
I've been meaning to send this years ago. Sorry for the delay.
We've been maintaining the patch below on our servers for years
now (since 2015), even before ZFS was officially supported by
PVE.
We had experienced VM balloons swelling and processes in VMs
running out of memory even though the host had tons of RAM.
We had tracked that down to pvestatd reclaiming memory from the
VMs. pvestatd targets 80% memory utilisation (in terms of memory
that is not free and not in buffers or caches: (memtotal -
memfree - buffers - cached) / memtotal).
The problem is that the ZFS ARC is tracked independendly (not as
part of "buffers" or "cached" above).
The size of that ARC cache also adapts with memory pressure. But
here, since the autoballooning frees memory as soon as it's used
up by the ARC, the ARC size grows and grows while VMs access
their disk, and we've got plenty of wasted free memory that is
never used.
So in the end, with an ARC allowed to grow up to half the RAM,
we end up in a situation where pvestatd in effect targets 30%
max memory utilisation (with 20% free or in buffers and 50% in
ARC).
Something similar happens for KSM (memory page deduplication).
/usr/sbin/ksmtuned monitors memory utilisation (again
total-cached-buffers-free) against kvm process memory
allocation, and tells the ksm daemon to scan more and more
pages, more and more aggressively as long as the "used" memory
is above 80%.
That probably explains why performances decrease significantly
after a while and why doing a "echo 3 >
/proc/sys/vm/drop_caches" (which clears buffers, caches *AND*
the ZFS arc cache) gives a second life to the system.
(by the way, a recent version of ProcFSTools.pm added a
read_pressure function, but it doesn't look like it's used
anywhere).
--- /usr/share/perl5/PVE/ProcFSTools.pm.distrib 2020-12-03 15:53:17.000000000 +0000
+++ /usr/share/perl5/PVE/ProcFSTools.pm 2021-01-19 13:44:42.480272044 +0000
@@ -268,6 +268,19 @@ sub read_meminfo {
$res->{memtotal} = $d->{memtotal};
$res->{memfree} = $d->{memfree} + $d->{buffers} + $d->{cached};
+
+ # Add the ZFS ARC if any
+ if (my $fh_arc = IO::File->new("/proc/spl/kstat/zfs/arcstats", "r")) {
+ while (my $line = <$fh_arc>) {
+ if ($line =~ m/^size .* (\d+)/) {
+ # "size" already in bytes
+ $res->{memfree} += $1;
+ last;
+ }
+ }
+ close($fh_arc);
+ }
+
$res->{memused} = $res->{memtotal} - $res->{memfree};
$res->{swaptotal} = $d->{swaptotal};
--- /usr/sbin/ksmtuned.distrib 2020-07-24 10:04:45.827828719 +0100
+++ /usr/sbin/ksmtuned 2021-01-19 14:37:43.416360037 +0000
@@ -75,10 +75,17 @@ committed_memory () {
ps -C "$progname" -o vsz= | awk '{ sum += $1 }; END { print sum }'
}
-free_memory () {
- awk '/^(MemFree|Buffers|Cached):/ {free += $2}; END {print free}' \
- /proc/meminfo
-}
+free_memory () (
+ shopt -s nullglob
+ exec awk '
+ NR == FNR {
+ if (/^(MemFree|Buffers|Cached):/) free += $2
+ next
+ }
+ $1 == "size" {free += int($3/1024)}
+ END {print free}
+ ' /proc/meminfo /proc/spl/kstat/zfs/[a]rcstats
+)
increase_npages() {
local delta
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [pve-devel] [PATCH] ZFS ARC size not taken into account by pvestatd or ksmtuned
2021-01-19 14:42 [pve-devel] [PATCH] ZFS ARC size not taken into account by pvestatd or ksmtuned Stephane Chazelas
@ 2021-01-23 9:14 ` Bruce Wainer
2021-01-24 9:22 ` Stephane CHAZELAS
0 siblings, 1 reply; 3+ messages in thread
From: Bruce Wainer @ 2021-01-23 9:14 UTC (permalink / raw)
To: Proxmox VE development discussion, stephane
Hello Stephane,
I believe this change is very important and would fix an issue I'm having
on some of my servers. However the Proxmox team seems to be particular
about how they accept patches (which I don't blame them for, it just is
what it is). The details are here:
https://pve.proxmox.com/wiki/Developer_Documentation - but in general you
need to submit a Contributor Agreement, and submit the patch in a way that
makes it easy for them to apply it.
Alternatively if you are not willing to go through the steps of submitting
this yourself, we might be able to work things out so I can submit it.
Please let me know.
Sincerely,
Bruce Wainer
On Wed, Jan 20, 2021 at 8:44 AM Stephane Chazelas <stephane@chazelas.org>
wrote:
> [note that I'm not subscribed to the list]
>
> Hello,
>
> I've been meaning to send this years ago. Sorry for the delay.
>
> We've been maintaining the patch below on our servers for years
> now (since 2015), even before ZFS was officially supported by
> PVE.
>
> We had experienced VM balloons swelling and processes in VMs
> running out of memory even though the host had tons of RAM.
>
> We had tracked that down to pvestatd reclaiming memory from the
> VMs. pvestatd targets 80% memory utilisation (in terms of memory
> that is not free and not in buffers or caches: (memtotal -
> memfree - buffers - cached) / memtotal).
>
> The problem is that the ZFS ARC is tracked independendly (not as
> part of "buffers" or "cached" above).
>
> The size of that ARC cache also adapts with memory pressure. But
> here, since the autoballooning frees memory as soon as it's used
> up by the ARC, the ARC size grows and grows while VMs access
> their disk, and we've got plenty of wasted free memory that is
> never used.
>
> So in the end, with an ARC allowed to grow up to half the RAM,
> we end up in a situation where pvestatd in effect targets 30%
> max memory utilisation (with 20% free or in buffers and 50% in
> ARC).
>
> Something similar happens for KSM (memory page deduplication).
> /usr/sbin/ksmtuned monitors memory utilisation (again
> total-cached-buffers-free) against kvm process memory
> allocation, and tells the ksm daemon to scan more and more
> pages, more and more aggressively as long as the "used" memory
> is above 80%.
>
> That probably explains why performances decrease significantly
> after a while and why doing a "echo 3 >
> /proc/sys/vm/drop_caches" (which clears buffers, caches *AND*
> the ZFS arc cache) gives a second life to the system.
>
> (by the way, a recent version of ProcFSTools.pm added a
> read_pressure function, but it doesn't look like it's used
> anywhere).
>
> --- /usr/share/perl5/PVE/ProcFSTools.pm.distrib 2020-12-03
> 15:53:17.000000000 +0000
> +++ /usr/share/perl5/PVE/ProcFSTools.pm 2021-01-19 13:44:42.480272044 +0000
> @@ -268,6 +268,19 @@ sub read_meminfo {
>
> $res->{memtotal} = $d->{memtotal};
> $res->{memfree} = $d->{memfree} + $d->{buffers} + $d->{cached};
> +
> + # Add the ZFS ARC if any
> + if (my $fh_arc = IO::File->new("/proc/spl/kstat/zfs/arcstats", "r")) {
> + while (my $line = <$fh_arc>) {
> + if ($line =~ m/^size .* (\d+)/) {
> + # "size" already in bytes
> + $res->{memfree} += $1;
> + last;
> + }
> + }
> + close($fh_arc);
> + }
> +
> $res->{memused} = $res->{memtotal} - $res->{memfree};
>
> $res->{swaptotal} = $d->{swaptotal};
> --- /usr/sbin/ksmtuned.distrib 2020-07-24 10:04:45.827828719 +0100
> +++ /usr/sbin/ksmtuned 2021-01-19 14:37:43.416360037 +0000
> @@ -75,10 +75,17 @@ committed_memory () {
> ps -C "$progname" -o vsz= | awk '{ sum += $1 }; END { print sum }'
> }
>
> -free_memory () {
> - awk '/^(MemFree|Buffers|Cached):/ {free += $2}; END {print free}' \
> - /proc/meminfo
> -}
> +free_memory () (
> + shopt -s nullglob
> + exec awk '
> + NR == FNR {
> + if (/^(MemFree|Buffers|Cached):/) free += $2
> + next
> + }
> + $1 == "size" {free += int($3/1024)}
> + END {print free}
> + ' /proc/meminfo /proc/spl/kstat/zfs/[a]rcstats
> +)
>
> increase_npages() {
> local delta
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [pve-devel] [PATCH] ZFS ARC size not taken into account by pvestatd or ksmtuned
2021-01-23 9:14 ` Bruce Wainer
@ 2021-01-24 9:22 ` Stephane CHAZELAS
0 siblings, 0 replies; 3+ messages in thread
From: Stephane CHAZELAS @ 2021-01-24 9:22 UTC (permalink / raw)
To: Bruce Wainer; +Cc: Proxmox VE development discussion
2021-01-23 04:14:52 -0500, Bruce Wainer:
[...]
> Alternatively if you are not willing to go through the steps of submitting
> this yourself, we might be able to work things out so I can submit it.
[...]
Hi Bruce,
yes please. That's a very simple change. The idea is just to
count the ARC size as buffer/cache like the rest of the buffer/
cache. It doesn't really matter much how it's done exactly. I
use git rarely enough that it always me hours to get back into
it.
I see that ksmtuned is actually upstream at RedHat so I suppose
Proxmox will likely want to add the change as an extra debian
patch.
I think I should point out that I've not revisited that issue in
years. And at the time the aim was to quickly address the issue
for our specific use case. There may be better ways to address
it these days. In particular, I'm not sure why the kernel
doesn't account the ARC in buffers/cache in the first place,
maybe that's customisable now. I've not had a look at that
/proc/pressure thing referenced in the latest commit of
ProcFSTools either. But in any case, the patch I submitted has
proved to be an improvement for us.
Cheers,
Stephane
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-01-24 9:23 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19 14:42 [pve-devel] [PATCH] ZFS ARC size not taken into account by pvestatd or ksmtuned Stephane Chazelas
2021-01-23 9:14 ` Bruce Wainer
2021-01-24 9:22 ` Stephane CHAZELAS
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox