From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 814941FF16F for ; Thu, 19 Dec 2024 15:30:24 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id AB1D48EA9; Thu, 19 Dec 2024 15:30:24 +0100 (CET) From: Fiona Ebner To: pve-devel@lists.proxmox.com Date: Thu, 19 Dec 2024 15:29:47 +0100 Message-Id: <20241219142947.1447728-5-f.ebner@proxmox.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20241219142947.1447728-1-f.ebner@proxmox.com> References: <20241219142947.1447728-1-f.ebner@proxmox.com> MIME-Version: 1.0 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.178 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_LOTSOFHASH 0.25 Emails with lots of hash-like gibberish RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] [PATCH bookworm-6.8 kernel 4/4] backport fix for kvm performance regression with Intel Emerald Rapids X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE development discussion Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" Adapted to context change in "arch/x86/kvm/cpuid.h", because of the vcpu_supports_xsave_pkru() function that got added by Proxmox VE downstream patch "kvm: xsave set: mask-out PKRU bit in xfeatures if vCPU has no support". But otherwise clean cherry-pick from linux-next, no functional changes. Signed-off-by: Fiona Ebner --- ...UID.0xD-XSTATE-offsets-sizes-during-.patch | 165 ++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 patches/kernel/0019-KVM-x86-Cache-CPUID.0xD-XSTATE-offsets-sizes-during-.patch diff --git a/patches/kernel/0019-KVM-x86-Cache-CPUID.0xD-XSTATE-offsets-sizes-during-.patch b/patches/kernel/0019-KVM-x86-Cache-CPUID.0xD-XSTATE-offsets-sizes-during-.patch new file mode 100644 index 0000000..beda78e --- /dev/null +++ b/patches/kernel/0019-KVM-x86-Cache-CPUID.0xD-XSTATE-offsets-sizes-during-.patch @@ -0,0 +1,165 @@ +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 +From: Sean Christopherson +Date: Tue, 10 Dec 2024 17:32:58 -0800 +Subject: [PATCH] KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module + init + +Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to +avoid the overead of CPUID during runtime. The offset, size, and metadata +for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e. +is constant for a given CPU, and thus can be cached during module load. + +On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where +recomputing XSAVE offsets and sizes results in a 4x increase in latency of +nested VM-Enter and VM-Exit (nested transitions can trigger +xstate_required_size() multiple times per transition), relative to using +cached values. The issue is easily visible by running `perf top` while +triggering nested transitions: kvm_update_cpuid_runtime() shows up at a +whopping 50%. + +As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test +and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested +roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald +Rapids (EMR) takes: + + SKX 11650 + ICX 22350 + EMR 28850 + +Using cached values, the latency drops to: + + SKX 6850 + ICX 9000 + EMR 7900 + +The underlying issue is that CPUID itself is slow on ICX, and comically +slow on EMR. The problem is exacerbated on CPUs which support XSAVES +and/or XSAVEC, as KVM invokes xstate_required_size() twice on each +runtime CPUID update, and because there are more supported XSAVE features +(CPUID for supported XSAVE feature sub-leafs is significantly slower). + + SKX: + CPUID.0xD.2 = 348 cycles + CPUID.0xD.3 = 400 cycles + CPUID.0xD.4 = 276 cycles + CPUID.0xD.5 = 236 cycles + + + EMR: + CPUID.0xD.2 = 1138 cycles + CPUID.0xD.3 = 1362 cycles + CPUID.0xD.4 = 1068 cycles + CPUID.0xD.5 = 910 cycles + CPUID.0xD.6 = 914 cycles + CPUID.0xD.7 = 1350 cycles + CPUID.0xD.8 = 734 cycles + CPUID.0xD.9 = 766 cycles + CPUID.0xD.10 = 732 cycles + CPUID.0xD.11 = 718 cycles + CPUID.0xD.12 = 734 cycles + CPUID.0xD.13 = 1700 cycles + CPUID.0xD.14 = 1126 cycles + CPUID.0xD.15 = 898 cycles + CPUID.0xD.16 = 716 cycles + CPUID.0xD.17 = 748 cycles + CPUID.0xD.18 = 776 cycles + +Note, updating runtime CPUID information multiple times per nested +transition is itself a flaw, especially since CPUID is a mandotory +intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated +CPUID state is up-to-date while running L2. That flaw will be fixed in a +future patch, as deferring runtime CPUID updates is more subtle than it +appears at first glance, the benefits aren't super critical to have once +the XSAVE issue is resolved, and caching CPUID output is desirable even if +KVM's updates are deferred. + +Cc: Jim Mattson +Cc: stable@vger.kernel.org +Signed-off-by: Sean Christopherson +Message-ID: <20241211013302.1347853-2-seanjc@google.com> +Signed-off-by: Paolo Bonzini +(cherry picked from commit 1201f226c863b7da739f7420ddba818cedf372fc) +Signed-off-by: Fiona Ebner +--- + arch/x86/kvm/cpuid.c | 31 ++++++++++++++++++++++++++----- + arch/x86/kvm/cpuid.h | 1 + + arch/x86/kvm/x86.c | 2 ++ + 3 files changed, 29 insertions(+), 5 deletions(-) + +diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c +index d68c04bde5ededcd2ffb24300646f4a9d9c1bc9a..be6e9995fb6dcd84aa21ff7a324b585d0a4a008f 100644 +--- a/arch/x86/kvm/cpuid.c ++++ b/arch/x86/kvm/cpuid.c +@@ -36,6 +36,26 @@ + u32 kvm_cpu_caps[NR_KVM_CPU_CAPS] __read_mostly; + EXPORT_SYMBOL_GPL(kvm_cpu_caps); + ++struct cpuid_xstate_sizes { ++ u32 eax; ++ u32 ebx; ++ u32 ecx; ++}; ++ ++static struct cpuid_xstate_sizes xstate_sizes[XFEATURE_MAX] __ro_after_init; ++ ++void __init kvm_init_xstate_sizes(void) ++{ ++ u32 ign; ++ int i; ++ ++ for (i = XFEATURE_YMM; i < ARRAY_SIZE(xstate_sizes); i++) { ++ struct cpuid_xstate_sizes *xs = &xstate_sizes[i]; ++ ++ cpuid_count(0xD, i, &xs->eax, &xs->ebx, &xs->ecx, &ign); ++ } ++} ++ + u32 xstate_required_size(u64 xstate_bv, bool compacted) + { + int feature_bit = 0; +@@ -44,14 +64,15 @@ u32 xstate_required_size(u64 xstate_bv, bool compacted) + xstate_bv &= XFEATURE_MASK_EXTEND; + while (xstate_bv) { + if (xstate_bv & 0x1) { +- u32 eax, ebx, ecx, edx, offset; +- cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx); ++ struct cpuid_xstate_sizes *xs = &xstate_sizes[feature_bit]; ++ u32 offset; ++ + /* ECX[1]: 64B alignment in compacted form */ + if (compacted) +- offset = (ecx & 0x2) ? ALIGN(ret, 64) : ret; ++ offset = (xs->ecx & 0x2) ? ALIGN(ret, 64) : ret; + else +- offset = ebx; +- ret = max(ret, offset + eax); ++ offset = xs->ebx; ++ ret = max(ret, offset + xs->eax); + } + + xstate_bv >>= 1; +diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h +index 07da153802e4d4f75219d77596e323347393eab7..a50b57e5d40073027901249a355b65b9a8b54a74 100644 +--- a/arch/x86/kvm/cpuid.h ++++ b/arch/x86/kvm/cpuid.h +@@ -34,6 +34,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx, + + bool vcpu_supports_xsave_pkru(struct kvm_vcpu *vcpu); + ++void __init kvm_init_xstate_sizes(void); + u32 xstate_required_size(u64 xstate_bv, bool compacted); + + int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu); +diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c +index 08ef6f01bf7e8c9c6cdb842f23adac09195c1aff..558e267ae0def34227f6f854a0955a26766a6e2c 100644 +--- a/arch/x86/kvm/x86.c ++++ b/arch/x86/kvm/x86.c +@@ -13930,6 +13930,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit); + + static int __init kvm_x86_init(void) + { ++ kvm_init_xstate_sizes(); ++ + kvm_mmu_x86_module_init(); + mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible(); + return 0; -- 2.39.5 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel