From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 009331FF18C for ; Tue, 14 Apr 2026 23:16:43 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 8E1E0220CA; Tue, 14 Apr 2026 23:16:42 +0200 (CEST) From: Thomas Lamprecht To: pve-devel@lists.proxmox.com Subject: [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait Date: Tue, 14 Apr 2026 23:09:56 +0200 Message-ID: <20260414211600.4023940-1-t.lamprecht@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776201289828 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.001 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: 7ABO35RFL5UHEXNK7DRAIYG2BQZQ44CC X-Message-ID-Hash: 7ABO35RFL5UHEXNK7DRAIYG2BQZQ44CC X-MailFrom: t.lamprecht@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Based on Jens Axboe's succinct reply [0] on the kernel side: > It's not "IO pressure", it's the useless iowait metric... > [...] > If you won't want it, just turn it off with io_uring_set_iowait(). [0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ Signed-off-by: Thomas Lamprecht --- Might need some closer checking, but seems to work out OK. Thanks @Fiona for some great off-list input w.r.t. this. If it works out I might also submit this upstream, but would need to recheck the current status (as tbh. I'd be a bit surprised if nobody run into this and didn't get annoyed yet) upstream @qemu. ...void-idle-event-loop-being-accounted.patch | 89 +++++++++++++++++++ debian/patches/series | 1 + 2 files changed, 90 insertions(+) create mode 100644 debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch diff --git a/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch new file mode 100644 index 0000000..b23cd50 --- /dev/null +++ b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch @@ -0,0 +1,89 @@ +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 +From: Thomas Lamprecht +Date: Wed, 1 Apr 2026 21:08:28 +0200 +Subject: [PATCH] fdmon-io_uring: avoid idle event loop being accounted as IO + wait + +Since QEMU 10.2, iothreads use io_uring for file descriptor monitoring +when available at build time. The blocking wait in io_uring_enter() +gets accounted as IO wait by the kernel, causing nearly 100% IO +pressure in PSI even when the system is idle. + +Split io_uring_submit_and_wait() into separate submit and wait calls +so the IORING_ENTER_NO_IOWAIT flag can be passed to io_uring_enter(). +Only set the flag when the timeout is the sole pending SQE, i.e., the +event loop is truly idle with no block layer IO or poll re-arms +pending. The flag is gated on IORING_FEAT_NO_IOWAIT so older kernels +are unaffected. + +Signed-off-by: Thomas Lamprecht +--- + util/fdmon-io_uring.c | 34 +++++++++++++++++++++++++++++++++- + 1 file changed, 33 insertions(+), 1 deletion(-) + +diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c +index d0b56127c6..1d24c2dd5d 100644 +--- a/util/fdmon-io_uring.c ++++ b/util/fdmon-io_uring.c +@@ -51,6 +51,15 @@ + #include "aio-posix.h" + #include "trace.h" + ++/* Compat defines for liburing versions that lack NO_IOWAIT support */ ++#ifndef IORING_ENTER_NO_IOWAIT ++#define IORING_ENTER_NO_IOWAIT (1U << 7) ++#endif ++ ++#ifndef IORING_FEAT_NO_IOWAIT ++#define IORING_FEAT_NO_IOWAIT (1U << 17) ++#endif ++ + enum { + FDMON_IO_URING_ENTRIES = 128, /* sq/cq ring size */ + +@@ -385,6 +394,7 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list, + { + struct __kernel_timespec ts; + unsigned wait_nr = 1; /* block until at least one cqe is ready */ ++ bool no_iowait; + int ret; + + if (timeout == 0) { +@@ -405,14 +415,36 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list, + + fill_sq_ring(ctx); + ++ /* ++ * When only the timeout SQE is pending (or none at all), the event loop ++ * is idle and the blocking wait should not be accounted as IO wait in PSI. ++ */ ++ no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) && ++ io_uring_sq_ready(&ctx->fdmon_io_uring) == ++ (timeout > 0 ? 1 : 0); ++ + /* + * Loop to handle signals in both cases: + * 1. If no SQEs were submitted, then -EINTR is returned. + * 2. If SQEs were submitted then the number of SQEs submitted is returned + * rather than -EINTR. ++ * ++ * Submit and wait are split into separate calls so we can pass ++ * IORING_ENTER_NO_IOWAIT when the event loop is idle. + */ + do { +- ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr); ++ ret = io_uring_submit(&ctx->fdmon_io_uring); ++ ++ if (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)) { ++ unsigned flags = IORING_ENTER_GETEVENTS; ++ ++ if (no_iowait) { ++ flags |= IORING_ENTER_NO_IOWAIT; ++ } ++ ++ ret = io_uring_enter(ctx->fdmon_io_uring.ring_fd, 0, wait_nr, ++ flags, NULL); ++ } + } while (ret == -EINTR || + (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring))); + diff --git a/debian/patches/series b/debian/patches/series index 8ed0c52..16f7da5 100644 --- a/debian/patches/series +++ b/debian/patches/series @@ -28,6 +28,7 @@ extra/0027-io-fix-cleanup-for-TLS-I-O-source-data-on-cancellati.patch extra/0028-io-fix-cleanup-for-websock-I-O-source-data-on-cancel.patch extra/0029-hw-Make-qdev_get_printable_name-consistently-return-.patch extra/0030-fuse-Copy-write-buffer-content-before-polling.patch +extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch -- 2.47.3