all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait
@ 2026-04-14 21:09 Thomas Lamprecht
  0 siblings, 0 replies; only message in thread
From: Thomas Lamprecht @ 2026-04-14 21:09 UTC (permalink / raw)
  To: pve-devel

Based on Jens Axboe's succinct reply [0] on the kernel side:
> It's not "IO pressure", it's the useless iowait metric...
> [...]
> If you won't want it, just turn it off with io_uring_set_iowait().

[0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---

Might need some closer checking, but seems to work out OK.
Thanks @Fiona for some great off-list input w.r.t. this.

If it works out I might also submit this upstream, but would need to
recheck the current status (as tbh. I'd be a bit surprised if nobody run
into this and didn't get annoyed yet) upstream @qemu.

 ...void-idle-event-loop-being-accounted.patch | 89 +++++++++++++++++++
 debian/patches/series                         |  1 +
 2 files changed, 90 insertions(+)
 create mode 100644 debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch

diff --git a/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
new file mode 100644
index 0000000..b23cd50
--- /dev/null
+++ b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
@@ -0,0 +1,89 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Thomas Lamprecht <t.lamprecht@proxmox.com>
+Date: Wed, 1 Apr 2026 21:08:28 +0200
+Subject: [PATCH] fdmon-io_uring: avoid idle event loop being accounted as IO
+ wait
+
+Since QEMU 10.2, iothreads use io_uring for file descriptor monitoring
+when available at build time. The blocking wait in io_uring_enter()
+gets accounted as IO wait by the kernel, causing nearly 100% IO
+pressure in PSI even when the system is idle.
+
+Split io_uring_submit_and_wait() into separate submit and wait calls
+so the IORING_ENTER_NO_IOWAIT flag can be passed to io_uring_enter().
+Only set the flag when the timeout is the sole pending SQE, i.e., the
+event loop is truly idle with no block layer IO or poll re-arms
+pending. The flag is gated on IORING_FEAT_NO_IOWAIT so older kernels
+are unaffected.
+
+Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
+---
+ util/fdmon-io_uring.c | 34 +++++++++++++++++++++++++++++++++-
+ 1 file changed, 33 insertions(+), 1 deletion(-)
+
+diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
+index d0b56127c6..1d24c2dd5d 100644
+--- a/util/fdmon-io_uring.c
++++ b/util/fdmon-io_uring.c
+@@ -51,6 +51,15 @@
+ #include "aio-posix.h"
+ #include "trace.h"
+ 
++/* Compat defines for liburing versions that lack NO_IOWAIT support */
++#ifndef IORING_ENTER_NO_IOWAIT
++#define IORING_ENTER_NO_IOWAIT (1U << 7)
++#endif
++
++#ifndef IORING_FEAT_NO_IOWAIT
++#define IORING_FEAT_NO_IOWAIT  (1U << 17)
++#endif
++
+ enum {
+     FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
+ 
+@@ -385,6 +394,7 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
+ {
+     struct __kernel_timespec ts;
+     unsigned wait_nr = 1; /* block until at least one cqe is ready */
++    bool no_iowait;
+     int ret;
+ 
+     if (timeout == 0) {
+@@ -405,14 +415,36 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
+ 
+     fill_sq_ring(ctx);
+ 
++    /*
++     * When only the timeout SQE is pending (or none at all), the event loop
++     * is idle and the blocking wait should not be accounted as IO wait in PSI.
++     */
++    no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
++                io_uring_sq_ready(&ctx->fdmon_io_uring) ==
++                    (timeout > 0 ? 1 : 0);
++
+     /*
+      * Loop to handle signals in both cases:
+      * 1. If no SQEs were submitted, then -EINTR is returned.
+      * 2. If SQEs were submitted then the number of SQEs submitted is returned
+      *    rather than -EINTR.
++     *
++     * Submit and wait are split into separate calls so we can pass
++     * IORING_ENTER_NO_IOWAIT when the event loop is idle.
+      */
+     do {
+-        ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
++        ret = io_uring_submit(&ctx->fdmon_io_uring);
++
++        if (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)) {
++            unsigned flags = IORING_ENTER_GETEVENTS;
++
++            if (no_iowait) {
++                flags |= IORING_ENTER_NO_IOWAIT;
++            }
++
++            ret = io_uring_enter(ctx->fdmon_io_uring.ring_fd, 0, wait_nr,
++                                 flags, NULL);
++        }
+     } while (ret == -EINTR ||
+              (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)));
+ 
diff --git a/debian/patches/series b/debian/patches/series
index 8ed0c52..16f7da5 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -28,6 +28,7 @@ extra/0027-io-fix-cleanup-for-TLS-I-O-source-data-on-cancellati.patch
 extra/0028-io-fix-cleanup-for-websock-I-O-source-data-on-cancel.patch
 extra/0029-hw-Make-qdev_get_printable_name-consistently-return-.patch
 extra/0030-fuse-Copy-write-buffer-content-before-polling.patch
+extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
 bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
 bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
 bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
-- 
2.47.3





^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-04-14 21:16 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-14 21:09 [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait Thomas Lamprecht

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal