From: Stoiko Ivanov <s.ivanov@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH pve-kernel 2/2] cherry-pick fix for uefi guests hanging upon guest-initialized reboot
Date: Thu, 24 Aug 2023 16:30:21 +0200 [thread overview]
Message-ID: <20230824143021.2440581-3-s.ivanov@proxmox.com> (raw)
In-Reply-To: <20230824143021.2440581-1-s.ivanov@proxmox.com>
This was identified as a potential fix for an issue we analyzed in our
Enterprise support, where guests would hang before the boot-loader
after being rebooted from within the guest (after applying updates for
RHEL 8).
https://lore.kernel.org/lkml/20230608090348.414990-1-gshan@redhat.com/
Suggested-by: Stefan Hanreich <s.hanreich@proxmox.com>
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
---
...l-stage2-mapping-on-invalid-memory-s.patch | 122 ++++++++++++++++++
1 file changed, 122 insertions(+)
create mode 100644 patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch
diff --git a/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch b/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch
new file mode 100644
index 000000000000..d50aab8e4d7c
--- /dev/null
+++ b/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch
@@ -0,0 +1,122 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Gavin Shan <gshan@redhat.com>
+Date: Thu, 15 Jun 2023 15:42:59 +1000
+Subject: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot
+
+commit 2230f9e1171a2e9731422a14d1bbc313c0b719d1 upstream.
+
+We run into guest hang in edk2 firmware when KSM is kept as running on
+the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
+device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
+buffered write. The status is returned by reading the memory region of
+the pflash device and the read request should have been forwarded to QEMU
+and emulated by it. Unfortunately, the read request is covered by an
+illegal stage2 mapping when the guest hang issue occurs. The read request
+is completed with QEMU bypassed and wrong status is fetched. The edk2
+firmware runs into an infinite loop with the wrong status.
+
+The illegal stage2 mapping is populated due to same page sharing by KSM
+at (C) even the associated memory slot has been marked as invalid at (B)
+when the memory slot is requested to be deleted. It's notable that the
+active and inactive memory slots can't be swapped when we're in the middle
+of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
+is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
+to zero again. Besides, the swapping from the active to the inactive memory
+slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
+corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
+
+ CPU-A CPU-B
+ ----- -----
+ ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
+ kvm_vm_ioctl_set_memory_region
+ kvm_set_memory_region
+ __kvm_set_memory_region
+ kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
+ kvm_invalidate_memslot
+ kvm_copy_memslot
+ kvm_replace_memslot
+ kvm_swap_active_memslots (A)
+ kvm_arch_flush_shadow_memslot (B)
+ same page sharing by KSM
+ kvm_mmu_notifier_invalidate_range_start
+ :
+ kvm_mmu_notifier_change_pte
+ kvm_handle_hva_range
+ __kvm_handle_hva_range
+ kvm_set_spte_gfn (C)
+ :
+ kvm_mmu_notifier_invalidate_range_end
+
+Fix the issue by skipping the invalid memory slot at (C) to avoid the
+illegal stage2 mapping so that the read request for the pflash's status
+is forwarded to QEMU and emulated by it. In this way, the correct pflash's
+status can be returned from QEMU to break the infinite loop in the edk2
+firmware.
+
+We tried a git-bisect and the first problematic commit is cd4c71835228 ("
+KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
+clean_dcache_guest_page() is called after the memory slots are iterated
+in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
+before the iteration on the memory slots before this commit. This change
+literally enlarges the racy window between kvm_mmu_notifier_change_pte()
+and memory slot removal so that we're able to reproduce the issue in a
+practical test case. However, the issue exists since commit d5d8184d35c9
+("KVM: ARM: Memory virtualization setup").
+
+Cc: stable@vger.kernel.org # v3.9+
+Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
+Reported-by: Shuai Hu <hshuai@redhat.com>
+Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
+Signed-off-by: Gavin Shan <gshan@redhat.com>
+Reviewed-by: David Hildenbrand <david@redhat.com>
+Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
+Reviewed-by: Peter Xu <peterx@redhat.com>
+Reviewed-by: Sean Christopherson <seanjc@google.com>
+Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
+Message-Id: <20230615054259.14911-1-gshan@redhat.com>
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+(cherry picked from commit 953dd7e2df8181d5ce4117fca347992d616f0621)
+Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
+---
+ virt/kvm/kvm_main.c | 20 +++++++++++++++++++-
+ 1 file changed, 19 insertions(+), 1 deletion(-)
+
+diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
+index db159be9d5b8..6deb43c2d091 100644
+--- a/virt/kvm/kvm_main.c
++++ b/virt/kvm/kvm_main.c
+@@ -636,6 +636,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
+
+ return __kvm_handle_hva_range(kvm, &range);
+ }
++
++static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
++{
++ /*
++ * Skipping invalid memslots is correct if and only change_pte() is
++ * surrounded by invalidate_range_{start,end}(), which is currently
++ * guaranteed by the primary MMU. If that ever changes, KVM needs to
++ * unmap the memslot instead of skipping the memslot to ensure that KVM
++ * doesn't hold references to the old PFN.
++ */
++ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
++
++ if (range->slot->flags & KVM_MEMSLOT_INVALID)
++ return false;
++
++ return kvm_set_spte_gfn(kvm, range);
++}
++
+ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address,
+@@ -656,7 +674,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
+ if (!READ_ONCE(kvm->mmu_notifier_count))
+ return;
+
+- kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
++ kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
+ }
+
+ void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
--
2.39.2
next prev parent reply other threads:[~2023-08-24 14:31 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-24 14:30 [pve-devel] [PATCH pve-kernel 0/2] cherry-pick a patch from kernel.org stable 5.15 for guests hanging during reboot Stoiko Ivanov
2023-08-24 14:30 ` [pve-devel] [PATCH pve-kernel 1/2] refresh patches after ./debian/scripts/export-patchqueue Stoiko Ivanov
2023-08-24 14:30 ` Stoiko Ivanov [this message]
2023-08-25 7:35 ` [pve-devel] [PATCH pve-kernel 2/2] cherry-pick fix for uefi guests hanging upon guest-initialized reboot Fiona Ebner
2023-08-25 7:40 ` Stefan Hanreich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230824143021.2440581-3-s.ivanov@proxmox.com \
--to=s.ivanov@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.