From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 27A9DD4E2 for ; Fri, 14 Jul 2023 20:31:13 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id F3DE6328B9 for ; Fri, 14 Jul 2023 20:30:42 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 14 Jul 2023 20:30:41 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 90B3742023 for ; Fri, 14 Jul 2023 20:30:41 +0200 (CEST) From: Thomas Lamprecht To: pve-devel@lists.proxmox.com Date: Fri, 14 Jul 2023 20:30:34 +0200 Message-Id: <20230714183034.866091-1-t.lamprecht@proxmox.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.077 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: [pve-devel] applied: [PATCH] kvm: xsave set: mask-out PKRU bit in xfeatures if vCPU has no support X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Jul 2023 18:31:13 -0000 Fixes live-migrations & snapshot-rollback of VMs with a restricted CPU type (e.g., qemu64) from our 5.15 based kernel (default Proxmox VE 7.4) to the 6.2 (and future newer) of Proxmox VE 8.0. Previous to (upstream kernel) commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") the PKRU bit of the host could leak into the state from the guest, which caused trouble when migrating between hosts with different CPUs, i.e., where the source supported it but the target did not, causing a general protection fault when the guest tried to use a pkru related instruction after the migration. But the fix, while welcome, caused a temporary out-of-sync state when migrating such a VM from a kernel without the fix to a kernel with the fix, as it threw of KVM when the CPUID of the guest and most of the state doesn't report XSAVE and thus any xfeatures, but PKRU and the related state is set as enabled, causing the vCPU to spin at 100% without any progress forever. The fix could be at two sites, either in QEMU or in the kernel, I choose the kernel as we have all the info there for a targeted heuristic so that we don't have to adapt QEMU and qemu-server, the latter even on both sides. Still, a short summary of the possible fixes and short drawbacks: * on QEMU-side either - clear the PKRU state in the migration saved state would be rather complicated to implement as the vCPU is initialised way before we have the saved xfeature state available to check what we'd need to do, plus the user-space only gets a memory blob from ioctl KVM_GET_XSAVE2 that it passes to KVM_SET_XSAVE ioctl, there are no ABI guarantees, and while the struct seem stable for 5.15 to 6.5-rc1, that doesn't has to be for future kernels, so off the table. - enforce that the CPUID reports PKU support even if it normally wouldn't. While this works (tested by hard-coding it as POC) it is a) not really nice and b) needs some interaction from qemu-server to enable this flag as otherwise we have no good info to decide when it's OK to do this, which means we need to adapt both PVE 7 and 8's qemu-server and also pve-qemu, workable but not optimal * on Kernel/KVM-side we can hook into the set XSAVE ioctl specific to the KVM subsystem, which already reduces chance of regression for all other places. There we have access to the union/struct definitions of the saved state and thus can savely cast to that. We also got access to the vCPU's CPUID capabilities, meaning we can check if the XCR0 (first XSAVE Control Register) reports that it support the PKRU feature, and if it does *NOT* but the saved xfeatures register from XSAVE *DOES* report it, we can safely assume that this combination is due to an migration from an older, leaky kernel – and clear the bit in the xfeature register before restoring it to the guest vCPU KVM state, avoiding the confusing situation that made the vCPU spin at 100%. This should be safe to do, as the guest vCPU CPUID never reported support for the PKRU feature, and it's also a relatively niche and newish feature. If it gains us something we can drop this patch a bit in the future Proxmox VE 9 major release, but we should ensure that VMs that where started before PVE 8 cannot be directly live-migrated to the release that includes that change; so we should rather only drop it if the maintenance burden is high. Signed-off-by: Thomas Lamprecht --- still need to evaluate if this could be (made) fit for upstream (mainline and/or ubuntu). ...sk-out-PKRU-bit-in-xfeatures-if-vCPU.patch | 133 ++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 patches/kernel/0008-kvm-xsave-set-mask-out-PKRU-bit-in-xfeatures-if-vCPU.patch diff --git a/patches/kernel/0008-kvm-xsave-set-mask-out-PKRU-bit-in-xfeatures-if-vCPU.patch b/patches/kernel/0008-kvm-xsave-set-mask-out-PKRU-bit-in-xfeatures-if-vCPU.patch new file mode 100644 index 0000000..dda75b8 --- /dev/null +++ b/patches/kernel/0008-kvm-xsave-set-mask-out-PKRU-bit-in-xfeatures-if-vCPU.patch @@ -0,0 +1,133 @@ +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 +From: Thomas Lamprecht +Date: Fri, 14 Jul 2023 18:10:32 +0200 +Subject: [PATCH] kvm: xsave set: mask-out PKRU bit in xfeatures if vCPU has no + support +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +Fixes live-migrations & snapshot-rollback of VMs with a restricted +CPU type (e.g., qemu64) from our 5.15 based kernel (default Proxmox +VE 7.4) to the 6.2 (and future newer) of Proxmox VE 8.0. + +Previous to ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to +supported bits of XCR0") the PKRU bit of the host could leak into the +state from the guest, which caused trouble when migrating between +hosts with different CPUs, i.e., where the source supported it but +the target did not, causing a general protection fault when the guest +tried to use a pkru related instruction after the migration. + +But the fix, while welcome, caused a temporary out-of-sync state when +migrating such a VM from a kernel without the fix to a kernel with +the fix, as it threw of KVM when the CPUID of the guest and most of +the state doesn't report XSAVE and thus any xfeatures, but PKRU and +the related state is set as enabled, causing the vCPU to spin at 100% +without any progress forever. + +The fix could be at two sites, either in QEMU or in the kernel, I +choose the kernel as we have all the info there for a targeted +heuristic so that we don't have to adapt QEMU and qemu-server, the +latter even on both sides. + +Still, a short summary of the possible fixes and short drawbacks: +* on QEMU-side either + - clear the PKRU state in the migration saved state would be rather + complicated to implement as the vCPU is initialised way before we + have the saved xfeature state available to check what we'd need + to do, plus the user-space only gets a memory blob from ioctl + KVM_GET_XSAVE2 that it passes to KVM_SET_XSAVE ioctl, there are + no ABI guarantees, and while the struct seem stable for 5.15 to + 6.5-rc1, that doesn't has to be for future kernels, so off the + table. + - enforce that the CPUID reports PKU support even if it normally + wouldn't. While this works (tested by hard-coding it as POC) it + is a) not really nice and b) needs some interaction from + qemu-server to enable this flag as otherwise we have no good info + to decide when it's OK to do this, which means we need to adapt + both PVE 7 and 8's qemu-server and also pve-qemu, workable but + not optimal + +* on Kernel/KVM-side we can hook into the set XSAVE ioctl specific to + the KVM subsystem, which already reduces chance of regression for + all other places. There we have access to the union/struct + definitions of the saved state and thus can savely cast to that. + We also got access to the vCPU's CPUID capabilities, meaning we can + check if the XCR0 (first XSAVE Control Register) reports + that it support the PKRU feature, and if it does *NOT* but the + saved xfeatures register from XSAVE *DOES* report it, we can safely + assume that this combination is due to an migration from an older, + leaky kernel – and clear the bit in the xfeature register before + restoring it to the guest vCPU KVM state, avoiding the confusing + situation that made the vCPU spin at 100%. + This should be safe to do, as the guest vCPU CPUID never reported + support for the PKRU feature, and it's also a relatively niche and + newish feature. + +If it gains us something we can drop this patch a bit in the future +Proxmox VE 9 major release, but we should ensure that VMs that where +started before PVE 8 cannot be directly live-migrated to the release +that includes that change; so we should rather only drop it if the +maintenance burden is high. + +Signed-off-by: Thomas Lamprecht +--- + arch/x86/kvm/cpuid.c | 6 ++++++ + arch/x86/kvm/cpuid.h | 2 ++ + arch/x86/kvm/x86.c | 13 +++++++++++++ + 3 files changed, 21 insertions(+) + +diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c +index 596061c1610e..9cbf12eca1d9 100644 +--- a/arch/x86/kvm/cpuid.c ++++ b/arch/x86/kvm/cpuid.c +@@ -251,6 +251,12 @@ static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent) + return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0; + } + ++bool vcpu_supports_xsave_pkru(struct kvm_vcpu *vcpu) { ++ u64 guest_supported_xcr0 = cpuid_get_supported_xcr0( ++ vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent); ++ return (guest_supported_xcr0 & XFEATURE_MASK_PKRU) != 0; ++} ++ + static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries, + int nent) + { +diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h +index b1658c0de847..12a02851ff57 100644 +--- a/arch/x86/kvm/cpuid.h ++++ b/arch/x86/kvm/cpuid.h +@@ -32,6 +32,8 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu, + bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, + u32 *ecx, u32 *edx, bool exact_only); + ++bool vcpu_supports_xsave_pkru(struct kvm_vcpu *vcpu); ++ + u32 xstate_required_size(u64 xstate_bv, bool compacted); + + int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu); +diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c +index 327890b2e0c5..1c5775d51495 100644 +--- a/arch/x86/kvm/x86.c ++++ b/arch/x86/kvm/x86.c +@@ -5330,6 +5330,19 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, + if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) + return 0; + ++ if (!vcpu_supports_xsave_pkru(vcpu)) { ++ void *buf = guest_xsave->region; ++ union fpregs_state *ustate = buf; ++ if (ustate->xsave.header.xfeatures & XFEATURE_MASK_PKRU) { ++ printk( ++ KERN_NOTICE "clearing PKRU xfeature bit as vCPU from PID %d" ++ " reports no PKRU support - migration from fpu-leaky kernel?", ++ current->pid ++ ); ++ ustate->xsave.header.xfeatures &= ~XFEATURE_MASK_PKRU; ++ } ++ } ++ + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, + guest_xsave->region, + kvm_caps.supported_xcr0, -- 2.39.2