From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.weber@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 8F629C1B08
 for <pve-devel@lists.proxmox.com>; Wed, 17 Jan 2024 15:45:55 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 5CB45B18B
 for <pve-devel@lists.proxmox.com>; Wed, 17 Jan 2024 15:45:55 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Wed, 17 Jan 2024 15:45:51 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 66E68457D2
 for <pve-devel@lists.proxmox.com>; Wed, 17 Jan 2024 15:45:51 +0100 (CET)
From: Friedrich Weber <f.weber@proxmox.com>
To: pve-devel@lists.proxmox.com
Date: Wed, 17 Jan 2024 15:45:21 +0100
Message-Id: <20240117144521.2960958-1-f.weber@proxmox.com>
X-Mailer: git-send-email 2.39.2
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.099 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com]
Subject: [pve-devel] [RFC kernel] cherry-pick scheduler fix to avoid
 temporary VM freezes on NUMA hosts
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 17 Jan 2024 14:45:55 -0000

Users have been reporting [1] that VMs occasionally become
unresponsive with high CPU usage for some time (varying between ~1 and
more than 60 seconds). After that time, the guests come back and
continue running. Windows VMs seem most affected (not responding to
pings during the hang, RDP sessions time out), but we also got reports
about Linux VMs (reporting soft lockups). The issue was not present on
host kernel 5.15 and was first reported with kernel 6.2. Users
reported that the issue becomes easier to trigger the more memory is
assigned to the guests. Setting mitigations=off was reported to
alleviate (but not eliminate) the issue. For most users the issue
seems to disappear after (also) disabling KSM [2], but some users
reported freezes even with KSM disabled [3].

It turned out the reports concerned NUMA hosts only, and that the
freezes correlated with runs of the NUMA balancer [4]. Users reported
that disabling the NUMA balancer resolves the issue (even with KSM
enabled).

We put together a Linux VM reproducer, ran a git-bisect on the kernel
to find the commit introducing the issue and asked upstream for help
[5]. As it turned out, an upstream bugreport was recently opened [6]
and a preliminary fix to the KVM TDP MMU was proposed [7]. With that
patch [7] on top of kernel 6.7, the reproducer does not trigger
freezes anymore. As of now, the patch (or its v2 [8]) is not yet
merged in the mainline kernel, and backporting it may be difficult due
to dependencies on other KVM changes [9].

However, the bugreport [6] also prompted an upstream developer to
propose a patch to the kernel scheduler logic that decides whether a
contended spinlock/rwlock should be dropped [10]. Without the patch,
PREEMPT_DYNAMIC kernels (such as ours) would always drop contended
locks. With the patch, the kernel only drops contended locks if the
kernel is currently set to preempt=full. As noted in the commit
message [10], this can (counter-intuitively) improve KVM performance.
Our kernel defaults to preempt=voluntary (according to
/sys/kernel/debug/sched/preempt), so with the patch it does not drop
contended locks anymore, and the reproducer does not trigger freezes
anymore. Hence, backport [10] to our kernel.

[1] https://forum.proxmox.com/threads/130727/
[2] https://forum.proxmox.com/threads/130727/page-4#post-575886
[3] https://forum.proxmox.com/threads/130727/page-8#post-617587
[4] https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing
[5] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/
[6] https://bugzilla.kernel.org/show_bug.cgi?id=218259
[7] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com/
[8] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@google.com/
[9] https://lore.kernel.org/kvm/Zaa654hwFKba_7pf@google.com/
[10] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com/

Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
---

Notes:
    This RFC is not meant to be applied immediately, but is intended to
    sum up the current state of the issue and point out potential fixes.
    
    The patch [10] backported in this RFC hasn't been reviewed upstream
    yet. And while it fixes the reproducer, it is not certain that it will
    fix freezes seen by users on real-world workloads. Hence, it would be
    desirable to also apply some variant of [7] [8] once it is applied
    upstream, however there may be difficulties backporting it, as noted
    above.
    
    So, in any case, for now it might sense to monitor how upstream
    handles the situation, and then react accordingly. I'll continue to
    participate upstream and send a v2 in due time.

 ...spinlocks-on-contention-iff-kernel-i.patch | 78 +++++++++++++++++++
 1 file changed, 78 insertions(+)
 create mode 100644 patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch

diff --git a/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch b/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch
new file mode 100644
index 0000000..932e2f2
--- /dev/null
+++ b/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch
@@ -0,0 +1,78 @@
+From 39f2bfe0177d3f56c9feac4e70424e4952949e2a Mon Sep 17 00:00:00 2001
+From: Sean Christopherson <seanjc@google.com>
+Date: Wed, 10 Jan 2024 13:47:23 -0800
+Subject: [PATCH] sched/core: Drop spinlocks on contention iff kernel is
+ preemptible
+
+Use preempt_model_preemptible() to detect a preemptible kernel when
+deciding whether or not to reschedule in order to drop a contended
+spinlock or rwlock.  Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
+built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
+preemption model is "none" or "voluntary".  In short, make kernels with
+dynamically selected models behave the same as kernels with statically
+selected models.
+
+Somewhat counter-intuitively, NOT yielding a lock can provide better
+latency for the relevant tasks/processes.  E.g. KVM x86's mmu_lock, a
+rwlock, is often contended between an invalidation event (takes mmu_lock
+for write) and a vCPU servicing a guest page fault (takes mmu_lock for
+read).  For _some_ setups, letting the invalidation task complete even
+if there is mmu_lock contention provides lower latency for *all* tasks,
+i.e. the invalidation completes sooner *and* the vCPU services the guest
+page fault sooner.
+
+But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
+can vary depending on the host VMM, the guest workload, the number of
+vCPUs, the number of pCPUs in the host, why there is lock contention, etc.
+
+In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
+opposite and removing contention yielding entirely) needs to come with a
+big pile of data proving that changing the status quo is a net positive.
+
+Cc: Valentin Schneider <valentin.schneider@arm.com>
+Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
+Cc: Marco Elver <elver@google.com>
+Cc: Frederic Weisbecker <frederic@kernel.org>
+Cc: David Matlack <dmatlack@google.com>
+Signed-off-by: Sean Christopherson <seanjc@google.com>
+---
+ include/linux/sched.h | 14 ++++++--------
+ 1 file changed, 6 insertions(+), 8 deletions(-)
+
+diff --git a/include/linux/sched.h b/include/linux/sched.h
+index 292c31697248..a274bc85f222 100644
+--- a/include/linux/sched.h
++++ b/include/linux/sched.h
+@@ -2234,11 +2234,10 @@ static inline bool preempt_model_preemptible(void)
+  */
+ static inline int spin_needbreak(spinlock_t *lock)
+ {
+-#ifdef CONFIG_PREEMPTION
++	if (!preempt_model_preemptible())
++		return 0;
++
+ 	return spin_is_contended(lock);
+-#else
+-	return 0;
+-#endif
+ }
+ 
+ /*
+@@ -2251,11 +2250,10 @@ static inline int spin_needbreak(spinlock_t *lock)
+  */
+ static inline int rwlock_needbreak(rwlock_t *lock)
+ {
+-#ifdef CONFIG_PREEMPTION
++	if (!preempt_model_preemptible())
++		return 0;
++
+ 	return rwlock_is_contended(lock);
+-#else
+-	return 0;
+-#endif
+ }
+ 
+ static __always_inline bool need_resched(void)
+-- 
+2.39.2
+
-- 
2.39.2