From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 4A2CC69941 for ; Tue, 19 Jan 2021 15:47:02 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 359D31F806 for ; Tue, 19 Jan 2021 15:46:32 +0100 (CET) Received: from mslow2.mail.gandi.net (mslow2.mail.gandi.net [217.70.178.242]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 498651F7FC for ; Tue, 19 Jan 2021 15:46:28 +0100 (CET) Received: from relay4-d.mail.gandi.net (unknown [217.70.183.196]) by mslow2.mail.gandi.net (Postfix) with ESMTP id 22ACA3A4249 for ; Tue, 19 Jan 2021 14:43:03 +0000 (UTC) X-Originating-IP: 94.10.124.211 Received: from chazelas.org (unknown [94.10.124.211]) (Authenticated sender: stephane@chazelas.org) by relay4-d.mail.gandi.net (Postfix) with ESMTPSA id 63792E0004 for ; Tue, 19 Jan 2021 14:42:35 +0000 (UTC) Date: Tue, 19 Jan 2021 14:42:35 +0000 From: Stephane Chazelas To: pve-devel@lists.proxmox.com Message-ID: <20210119144235.54f7jofljgjqpbts@chazelas.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-SPAM-LEVEL: Spam detection results: 1 JMQ_SPF_NEUTRAL 0.5 SPF set to ?all KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_LOW -0.7 Sender listed at https://www.dnswl.org/, low trust RCVD_IN_MSPIKE_H2 -0.001 Average reputation (+2) RCVD_IN_RP_RNBL 1.284 Relay in RNBL, https://senderscore.org/blacklistlookup/ SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [procfstools.pm] X-Mailman-Approved-At: Wed, 20 Jan 2021 14:44:04 +0100 Subject: [pve-devel] [PATCH] ZFS ARC size not taken into account by pvestatd or ksmtuned X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2021 14:47:02 -0000 [note that I'm not subscribed to the list] Hello, I've been meaning to send this years ago. Sorry for the delay. We've been maintaining the patch below on our servers for years now (since 2015), even before ZFS was officially supported by PVE. We had experienced VM balloons swelling and processes in VMs running out of memory even though the host had tons of RAM. We had tracked that down to pvestatd reclaiming memory from the VMs. pvestatd targets 80% memory utilisation (in terms of memory that is not free and not in buffers or caches: (memtotal - memfree - buffers - cached) / memtotal). The problem is that the ZFS ARC is tracked independendly (not as part of "buffers" or "cached" above). The size of that ARC cache also adapts with memory pressure. But here, since the autoballooning frees memory as soon as it's used up by the ARC, the ARC size grows and grows while VMs access their disk, and we've got plenty of wasted free memory that is never used. So in the end, with an ARC allowed to grow up to half the RAM, we end up in a situation where pvestatd in effect targets 30% max memory utilisation (with 20% free or in buffers and 50% in ARC). Something similar happens for KSM (memory page deduplication). /usr/sbin/ksmtuned monitors memory utilisation (again total-cached-buffers-free) against kvm process memory allocation, and tells the ksm daemon to scan more and more pages, more and more aggressively as long as the "used" memory is above 80%. That probably explains why performances decrease significantly after a while and why doing a "echo 3 > /proc/sys/vm/drop_caches" (which clears buffers, caches *AND* the ZFS arc cache) gives a second life to the system. (by the way, a recent version of ProcFSTools.pm added a read_pressure function, but it doesn't look like it's used anywhere). --- /usr/share/perl5/PVE/ProcFSTools.pm.distrib 2020-12-03 15:53:17.000000000 +0000 +++ /usr/share/perl5/PVE/ProcFSTools.pm 2021-01-19 13:44:42.480272044 +0000 @@ -268,6 +268,19 @@ sub read_meminfo { $res->{memtotal} = $d->{memtotal}; $res->{memfree} = $d->{memfree} + $d->{buffers} + $d->{cached}; + + # Add the ZFS ARC if any + if (my $fh_arc = IO::File->new("/proc/spl/kstat/zfs/arcstats", "r")) { + while (my $line = <$fh_arc>) { + if ($line =~ m/^size .* (\d+)/) { + # "size" already in bytes + $res->{memfree} += $1; + last; + } + } + close($fh_arc); + } + $res->{memused} = $res->{memtotal} - $res->{memfree}; $res->{swaptotal} = $d->{swaptotal}; --- /usr/sbin/ksmtuned.distrib 2020-07-24 10:04:45.827828719 +0100 +++ /usr/sbin/ksmtuned 2021-01-19 14:37:43.416360037 +0000 @@ -75,10 +75,17 @@ committed_memory () { ps -C "$progname" -o vsz= | awk '{ sum += $1 }; END { print sum }' } -free_memory () { - awk '/^(MemFree|Buffers|Cached):/ {free += $2}; END {print free}' \ - /proc/meminfo -} +free_memory () ( + shopt -s nullglob + exec awk ' + NR == FNR { + if (/^(MemFree|Buffers|Cached):/) free += $2 + next + } + $1 == "size" {free += int($3/1024)} + END {print free} + ' /proc/meminfo /proc/spl/kstat/zfs/[a]rcstats +) increase_npages() { local delta