public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait
@ 2026-04-14 21:09 Thomas Lamprecht
  0 siblings, 0 replies; only message in thread
From: Thomas Lamprecht @ 2026-04-14 21:09 UTC (permalink / raw)
  To: pve-devel

Based on Jens Axboe's succinct reply [0] on the kernel side:
> It's not "IO pressure", it's the useless iowait metric...
> [...]
> If you won't want it, just turn it off with io_uring_set_iowait().

[0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---

Might need some closer checking, but seems to work out OK.
Thanks @Fiona for some great off-list input w.r.t. this.

If it works out I might also submit this upstream, but would need to
recheck the current status (as tbh. I'd be a bit surprised if nobody run
into this and didn't get annoyed yet) upstream @qemu.

 ...void-idle-event-loop-being-accounted.patch | 89 +++++++++++++++++++
 debian/patches/series                         |  1 +
 2 files changed, 90 insertions(+)
 create mode 100644 debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch

diff --git a/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
new file mode 100644
index 0000000..b23cd50
--- /dev/null
+++ b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
@@ -0,0 +1,89 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Thomas Lamprecht <t.lamprecht@proxmox.com>
+Date: Wed, 1 Apr 2026 21:08:28 +0200
+Subject: [PATCH] fdmon-io_uring: avoid idle event loop being accounted as IO
+ wait
+
+Since QEMU 10.2, iothreads use io_uring for file descriptor monitoring
+when available at build time. The blocking wait in io_uring_enter()
+gets accounted as IO wait by the kernel, causing nearly 100% IO
+pressure in PSI even when the system is idle.
+
+Split io_uring_submit_and_wait() into separate submit and wait calls
+so the IORING_ENTER_NO_IOWAIT flag can be passed to io_uring_enter().
+Only set the flag when the timeout is the sole pending SQE, i.e., the
+event loop is truly idle with no block layer IO or poll re-arms
+pending. The flag is gated on IORING_FEAT_NO_IOWAIT so older kernels
+are unaffected.
+
+Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
+---
+ util/fdmon-io_uring.c | 34 +++++++++++++++++++++++++++++++++-
+ 1 file changed, 33 insertions(+), 1 deletion(-)
+
+diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
+index d0b56127c6..1d24c2dd5d 100644
+--- a/util/fdmon-io_uring.c
++++ b/util/fdmon-io_uring.c
+@@ -51,6 +51,15 @@
+ #include "aio-posix.h"
+ #include "trace.h"
+ 
++/* Compat defines for liburing versions that lack NO_IOWAIT support */
++#ifndef IORING_ENTER_NO_IOWAIT
++#define IORING_ENTER_NO_IOWAIT (1U << 7)
++#endif
++
++#ifndef IORING_FEAT_NO_IOWAIT
++#define IORING_FEAT_NO_IOWAIT  (1U << 17)
++#endif
++
+ enum {
+     FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
+ 
+@@ -385,6 +394,7 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
+ {
+     struct __kernel_timespec ts;
+     unsigned wait_nr = 1; /* block until at least one cqe is ready */
++    bool no_iowait;
+     int ret;
+ 
+     if (timeout == 0) {
+@@ -405,14 +415,36 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
+ 
+     fill_sq_ring(ctx);
+ 
++    /*
++     * When only the timeout SQE is pending (or none at all), the event loop
++     * is idle and the blocking wait should not be accounted as IO wait in PSI.
++     */
++    no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
++                io_uring_sq_ready(&ctx->fdmon_io_uring) ==
++                    (timeout > 0 ? 1 : 0);
++
+     /*
+      * Loop to handle signals in both cases:
+      * 1. If no SQEs were submitted, then -EINTR is returned.
+      * 2. If SQEs were submitted then the number of SQEs submitted is returned
+      *    rather than -EINTR.
++     *
++     * Submit and wait are split into separate calls so we can pass
++     * IORING_ENTER_NO_IOWAIT when the event loop is idle.
+      */
+     do {
+-        ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
++        ret = io_uring_submit(&ctx->fdmon_io_uring);
++
++        if (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)) {
++            unsigned flags = IORING_ENTER_GETEVENTS;
++
++            if (no_iowait) {
++                flags |= IORING_ENTER_NO_IOWAIT;
++            }
++
++            ret = io_uring_enter(ctx->fdmon_io_uring.ring_fd, 0, wait_nr,
++                                 flags, NULL);
++        }
+     } while (ret == -EINTR ||
+              (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)));
+ 
diff --git a/debian/patches/series b/debian/patches/series
index 8ed0c52..16f7da5 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -28,6 +28,7 @@ extra/0027-io-fix-cleanup-for-TLS-I-O-source-data-on-cancellati.patch
 extra/0028-io-fix-cleanup-for-websock-I-O-source-data-on-cancel.patch
 extra/0029-hw-Make-qdev_get_printable_name-consistently-return-.patch
 extra/0030-fuse-Copy-write-buffer-content-before-polling.patch
+extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
 bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
 bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
 bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
-- 
2.47.3





^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-04-14 21:16 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-14 21:09 [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal