From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 8F629C1B08 for ; Wed, 17 Jan 2024 15:45:55 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 5CB45B18B for ; Wed, 17 Jan 2024 15:45:55 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 17 Jan 2024 15:45:51 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 66E68457D2 for ; Wed, 17 Jan 2024 15:45:51 +0100 (CET) From: Friedrich Weber To: pve-devel@lists.proxmox.com Date: Wed, 17 Jan 2024 15:45:21 +0100 Message-Id: <20240117144521.2960958-1-f.weber@proxmox.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.099 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Subject: [pve-devel] [RFC kernel] cherry-pick scheduler fix to avoid temporary VM freezes on NUMA hosts X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jan 2024 14:45:55 -0000 Users have been reporting [1] that VMs occasionally become unresponsive with high CPU usage for some time (varying between ~1 and more than 60 seconds). After that time, the guests come back and continue running. Windows VMs seem most affected (not responding to pings during the hang, RDP sessions time out), but we also got reports about Linux VMs (reporting soft lockups). The issue was not present on host kernel 5.15 and was first reported with kernel 6.2. Users reported that the issue becomes easier to trigger the more memory is assigned to the guests. Setting mitigations=off was reported to alleviate (but not eliminate) the issue. For most users the issue seems to disappear after (also) disabling KSM [2], but some users reported freezes even with KSM disabled [3]. It turned out the reports concerned NUMA hosts only, and that the freezes correlated with runs of the NUMA balancer [4]. Users reported that disabling the NUMA balancer resolves the issue (even with KSM enabled). We put together a Linux VM reproducer, ran a git-bisect on the kernel to find the commit introducing the issue and asked upstream for help [5]. As it turned out, an upstream bugreport was recently opened [6] and a preliminary fix to the KVM TDP MMU was proposed [7]. With that patch [7] on top of kernel 6.7, the reproducer does not trigger freezes anymore. As of now, the patch (or its v2 [8]) is not yet merged in the mainline kernel, and backporting it may be difficult due to dependencies on other KVM changes [9]. However, the bugreport [6] also prompted an upstream developer to propose a patch to the kernel scheduler logic that decides whether a contended spinlock/rwlock should be dropped [10]. Without the patch, PREEMPT_DYNAMIC kernels (such as ours) would always drop contended locks. With the patch, the kernel only drops contended locks if the kernel is currently set to preempt=full. As noted in the commit message [10], this can (counter-intuitively) improve KVM performance. Our kernel defaults to preempt=voluntary (according to /sys/kernel/debug/sched/preempt), so with the patch it does not drop contended locks anymore, and the reproducer does not trigger freezes anymore. Hence, backport [10] to our kernel. [1] https://forum.proxmox.com/threads/130727/ [2] https://forum.proxmox.com/threads/130727/page-4#post-575886 [3] https://forum.proxmox.com/threads/130727/page-8#post-617587 [4] https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing [5] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/ [6] https://bugzilla.kernel.org/show_bug.cgi?id=218259 [7] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com/ [8] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@google.com/ [9] https://lore.kernel.org/kvm/Zaa654hwFKba_7pf@google.com/ [10] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com/ Signed-off-by: Friedrich Weber --- Notes: This RFC is not meant to be applied immediately, but is intended to sum up the current state of the issue and point out potential fixes. The patch [10] backported in this RFC hasn't been reviewed upstream yet. And while it fixes the reproducer, it is not certain that it will fix freezes seen by users on real-world workloads. Hence, it would be desirable to also apply some variant of [7] [8] once it is applied upstream, however there may be difficulties backporting it, as noted above. So, in any case, for now it might sense to monitor how upstream handles the situation, and then react accordingly. I'll continue to participate upstream and send a v2 in due time. ...spinlocks-on-contention-iff-kernel-i.patch | 78 +++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch diff --git a/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch b/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch new file mode 100644 index 0000000..932e2f2 --- /dev/null +++ b/patches/kernel/0018-sched-core-Drop-spinlocks-on-contention-iff-kernel-i.patch @@ -0,0 +1,78 @@ +From 39f2bfe0177d3f56c9feac4e70424e4952949e2a Mon Sep 17 00:00:00 2001 +From: Sean Christopherson +Date: Wed, 10 Jan 2024 13:47:23 -0800 +Subject: [PATCH] sched/core: Drop spinlocks on contention iff kernel is + preemptible + +Use preempt_model_preemptible() to detect a preemptible kernel when +deciding whether or not to reschedule in order to drop a contended +spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels +built with PREEMPT_DYNAMIC=y will yield contended locks even if the live +preemption model is "none" or "voluntary". In short, make kernels with +dynamically selected models behave the same as kernels with statically +selected models. + +Somewhat counter-intuitively, NOT yielding a lock can provide better +latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a +rwlock, is often contended between an invalidation event (takes mmu_lock +for write) and a vCPU servicing a guest page fault (takes mmu_lock for +read). For _some_ setups, letting the invalidation task complete even +if there is mmu_lock contention provides lower latency for *all* tasks, +i.e. the invalidation completes sooner *and* the vCPU services the guest +page fault sooner. + +But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior +can vary depending on the host VMM, the guest workload, the number of +vCPUs, the number of pCPUs in the host, why there is lock contention, etc. + +In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the +opposite and removing contention yielding entirely) needs to come with a +big pile of data proving that changing the status quo is a net positive. + +Cc: Valentin Schneider +Cc: Peter Zijlstra (Intel) +Cc: Marco Elver +Cc: Frederic Weisbecker +Cc: David Matlack +Signed-off-by: Sean Christopherson +--- + include/linux/sched.h | 14 ++++++-------- + 1 file changed, 6 insertions(+), 8 deletions(-) + +diff --git a/include/linux/sched.h b/include/linux/sched.h +index 292c31697248..a274bc85f222 100644 +--- a/include/linux/sched.h ++++ b/include/linux/sched.h +@@ -2234,11 +2234,10 @@ static inline bool preempt_model_preemptible(void) + */ + static inline int spin_needbreak(spinlock_t *lock) + { +-#ifdef CONFIG_PREEMPTION ++ if (!preempt_model_preemptible()) ++ return 0; ++ + return spin_is_contended(lock); +-#else +- return 0; +-#endif + } + + /* +@@ -2251,11 +2250,10 @@ static inline int spin_needbreak(spinlock_t *lock) + */ + static inline int rwlock_needbreak(rwlock_t *lock) + { +-#ifdef CONFIG_PREEMPTION ++ if (!preempt_model_preemptible()) ++ return 0; ++ + return rwlock_is_contended(lock); +-#else +- return 0; +-#endif + } + + static __always_inline bool need_resched(void) +-- +2.39.2 + -- 2.39.2