[PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait
@ 2026-04-14 21:09 Thomas Lamprecht
  2026-04-22 14:55 ` Fiona Ebner
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Lamprecht @ 2026-04-14 21:09 UTC (permalink / raw)
  To: pve-devel

Based on Jens Axboe's succinct reply [0] on the kernel side:
> It's not "IO pressure", it's the useless iowait metric...
> [...]
> If you won't want it, just turn it off with io_uring_set_iowait().

[0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---

Might need some closer checking, but seems to work out OK.
Thanks @Fiona for some great off-list input w.r.t. this.

If it works out I might also submit this upstream, but would need to
recheck the current status (as tbh. I'd be a bit surprised if nobody run
into this and didn't get annoyed yet) upstream @qemu.

 ...void-idle-event-loop-being-accounted.patch | 89 +++++++++++++++++++
 debian/patches/series                         |  1 +
 2 files changed, 90 insertions(+)
 create mode 100644 debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch

diff --git a/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
new file mode 100644
index 0000000..b23cd50
--- /dev/null
+++ b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
@@ -0,0 +1,89 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Thomas Lamprecht <t.lamprecht@proxmox.com>
+Date: Wed, 1 Apr 2026 21:08:28 +0200
+Subject: [PATCH] fdmon-io_uring: avoid idle event loop being accounted as IO
+ wait
+
+Since QEMU 10.2, iothreads use io_uring for file descriptor monitoring
+when available at build time. The blocking wait in io_uring_enter()
+gets accounted as IO wait by the kernel, causing nearly 100% IO
+pressure in PSI even when the system is idle.
+
+Split io_uring_submit_and_wait() into separate submit and wait calls
+so the IORING_ENTER_NO_IOWAIT flag can be passed to io_uring_enter().
+Only set the flag when the timeout is the sole pending SQE, i.e., the
+event loop is truly idle with no block layer IO or poll re-arms
+pending. The flag is gated on IORING_FEAT_NO_IOWAIT so older kernels
+are unaffected.
+
+Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
+---
+ util/fdmon-io_uring.c | 34 +++++++++++++++++++++++++++++++++-
+ 1 file changed, 33 insertions(+), 1 deletion(-)
+
+diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
+index d0b56127c6..1d24c2dd5d 100644
+--- a/util/fdmon-io_uring.c
++++ b/util/fdmon-io_uring.c
+@@ -51,6 +51,15 @@
+ #include "aio-posix.h"
+ #include "trace.h"
+ 
++/* Compat defines for liburing versions that lack NO_IOWAIT support */
++#ifndef IORING_ENTER_NO_IOWAIT
++#define IORING_ENTER_NO_IOWAIT (1U << 7)
++#endif
++
++#ifndef IORING_FEAT_NO_IOWAIT
++#define IORING_FEAT_NO_IOWAIT  (1U << 17)
++#endif
++
+ enum {
+     FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
+ 
+@@ -385,6 +394,7 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
+ {
+     struct __kernel_timespec ts;
+     unsigned wait_nr = 1; /* block until at least one cqe is ready */
++    bool no_iowait;
+     int ret;
+ 
+     if (timeout == 0) {
+@@ -405,14 +415,36 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
+ 
+     fill_sq_ring(ctx);
+ 
++    /*
++     * When only the timeout SQE is pending (or none at all), the event loop
++     * is idle and the blocking wait should not be accounted as IO wait in PSI.
++     */
++    no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
++                io_uring_sq_ready(&ctx->fdmon_io_uring) ==
++                    (timeout > 0 ? 1 : 0);
++
+     /*
+      * Loop to handle signals in both cases:
+      * 1. If no SQEs were submitted, then -EINTR is returned.
+      * 2. If SQEs were submitted then the number of SQEs submitted is returned
+      *    rather than -EINTR.
++     *
++     * Submit and wait are split into separate calls so we can pass
++     * IORING_ENTER_NO_IOWAIT when the event loop is idle.
+      */
+     do {
+-        ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
++        ret = io_uring_submit(&ctx->fdmon_io_uring);
++
++        if (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)) {
++            unsigned flags = IORING_ENTER_GETEVENTS;
++
++            if (no_iowait) {
++                flags |= IORING_ENTER_NO_IOWAIT;
++            }
++
++            ret = io_uring_enter(ctx->fdmon_io_uring.ring_fd, 0, wait_nr,
++                                 flags, NULL);
++        }
+     } while (ret == -EINTR ||
+              (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)));
+ 
diff --git a/debian/patches/series b/debian/patches/series
index 8ed0c52..16f7da5 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -28,6 +28,7 @@ extra/0027-io-fix-cleanup-for-TLS-I-O-source-data-on-cancellati.patch
 extra/0028-io-fix-cleanup-for-websock-I-O-source-data-on-cancel.patch
 extra/0029-hw-Make-qdev_get_printable_name-consistently-return-.patch
 extra/0030-fuse-Copy-write-buffer-content-before-polling.patch
+extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
 bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
 bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
 bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
-- 
2.47.3





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait
  2026-04-14 21:09 [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait Thomas Lamprecht
@ 2026-04-22 14:55 ` Fiona Ebner
  2026-04-23 15:36   ` Fiona Ebner
  0 siblings, 1 reply; 5+ messages in thread
From: Fiona Ebner @ 2026-04-22 14:55 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

Am 14.04.26 um 11:15 PM schrieb Thomas Lamprecht:
> Based on Jens Axboe's succinct reply [0] on the kernel side:
>> It's not "IO pressure", it's the useless iowait metric...
>> [...]
>> If you won't want it, just turn it off with io_uring_set_iowait().
> 
> [0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 
> 

During testing with an RBD stroage (no krbd), I found that compared to
pve-qemu-kvm=10.1.2-7 the IO wait is still extremely much higher even
with your patch, for VMs just idling around. And still "more idle"
correlating to "more IO pressure". I'm seeing 0.0% or 0.x% with
pve-qemu-kvm=10.1.2-7 and 40-100% even with the patch.

Maybe there is a separate issue specific to the RBD block driver, but
not sure why it would be. At least, there is no IO pressure when setting
the IORING_ENTER_NO_IOWAIT flag unconditonally every time. My
suggestions below would only lead to setting the flag less often
compared to the current patch, so not help for this either. Will need to
investigate more.

> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
> ---
> 
> Might need some closer checking, but seems to work out OK.
> Thanks @Fiona for some great off-list input w.r.t. this.
> 
> If it works out I might also submit this upstream, but would need to
> recheck the current status (as tbh. I'd be a bit surprised if nobody run
> into this and didn't get annoyed yet) upstream @qemu.
> 
>  ...void-idle-event-loop-being-accounted.patch | 89 +++++++++++++++++++
>  debian/patches/series                         |  1 +
>  2 files changed, 90 insertions(+)
>  create mode 100644 debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
> 
> diff --git a/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
> new file mode 100644
> index 0000000..b23cd50
> --- /dev/null
> +++ b/debian/patches/extra/0031-fdmon-io_uring-avoid-idle-event-loop-being-accounted.patch
> @@ -0,0 +1,89 @@
> +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
> +From: Thomas Lamprecht <t.lamprecht@proxmox.com>
> +Date: Wed, 1 Apr 2026 21:08:28 +0200
> +Subject: [PATCH] fdmon-io_uring: avoid idle event loop being accounted as IO
> + wait
> +
> +Since QEMU 10.2, iothreads use io_uring for file descriptor monitoring
> +when available at build time. The blocking wait in io_uring_enter()
> +gets accounted as IO wait by the kernel, causing nearly 100% IO
> +pressure in PSI even when the system is idle.
> +
> +Split io_uring_submit_and_wait() into separate submit and wait calls
> +so the IORING_ENTER_NO_IOWAIT flag can be passed to io_uring_enter().
> +Only set the flag when the timeout is the sole pending SQE, i.e., the

Would change it to: "... sole pending SQE or none at all, i.e. ..."

> +event loop is truly idle with no block layer IO or poll re-arms
> +pending. The flag is gated on IORING_FEAT_NO_IOWAIT so older kernels
> +are unaffected.
> +
> +Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
> +---
> + util/fdmon-io_uring.c | 34 +++++++++++++++++++++++++++++++++-
> + 1 file changed, 33 insertions(+), 1 deletion(-)
> +
> +diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> +index d0b56127c6..1d24c2dd5d 100644
> +--- a/util/fdmon-io_uring.c
> ++++ b/util/fdmon-io_uring.c
> +@@ -51,6 +51,15 @@
> + #include "aio-posix.h"
> + #include "trace.h"
> + 
> ++/* Compat defines for liburing versions that lack NO_IOWAIT support */
> ++#ifndef IORING_ENTER_NO_IOWAIT
> ++#define IORING_ENTER_NO_IOWAIT (1U << 7)
> ++#endif
> ++
> ++#ifndef IORING_FEAT_NO_IOWAIT
> ++#define IORING_FEAT_NO_IOWAIT  (1U << 17)
> ++#endif
> ++
> + enum {
> +     FDMON_IO_URING_ENTRIES  = 128, /* sq/cq ring size */
> + 
> +@@ -385,6 +394,7 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
> + {
> +     struct __kernel_timespec ts;
> +     unsigned wait_nr = 1; /* block until at least one cqe is ready */
> ++    bool no_iowait;
> +     int ret;
> + 
> +     if (timeout == 0) {
> +@@ -405,14 +415,36 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,

When the timeout SQE is prepared, get_sqe() is called, which calls
io_uring_submit() if the ring is full of pending requests. We don't
detect those requests anymore with the io_uring_sq_ready() call below.

> + 
> +     fill_sq_ring(ctx);
> + 
> ++    /*
> ++     * When only the timeout SQE is pending (or none at all), the event loop
> ++     * is idle and the blocking wait should not be accounted as IO wait in PSI.
> ++     */
> ++    no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
> ++                io_uring_sq_ready(&ctx->fdmon_io_uring) ==
> ++                    (timeout > 0 ? 1 : 0);
> ++

While probably not too relevant in practice, there is a small race where
another thread could add an SQE after the check here and before submitting.

> +     /*
> +      * Loop to handle signals in both cases:
> +      * 1. If no SQEs were submitted, then -EINTR is returned.
> +      * 2. If SQEs were submitted then the number of SQEs submitted is returned
> +      *    rather than -EINTR.
> ++     *
> ++     * Submit and wait are split into separate calls so we can pass
> ++     * IORING_ENTER_NO_IOWAIT when the event loop is idle.
> +      */
> +     do {
> +-        ret = io_uring_submit_and_wait(&ctx->fdmon_io_uring, wait_nr);
> ++        ret = io_uring_submit(&ctx->fdmon_io_uring);

If an interrupt happens, the time window is larger, since there is
another call for submitting.

> ++
> ++        if (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)) {
> ++            unsigned flags = IORING_ENTER_GETEVENTS;
> ++
> ++            if (no_iowait) {
> ++                flags |= IORING_ENTER_NO_IOWAIT;
> ++            }
> ++
> ++            ret = io_uring_enter(ctx->fdmon_io_uring.ring_fd, 0, wait_nr,
> ++                                 flags, NULL);
> ++        }
> +     } while (ret == -EINTR ||
> +              (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)));
> + 

My proposal to avoid those issues:

> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index 3427dbd1a9..1ed619bfe2 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -406,7 +406,13 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>  {
>      struct __kernel_timespec ts;
>      unsigned wait_nr = 1; /* block until at least one cqe is ready */
> -    bool no_iowait;
> +    /*
> +     * Starts out as true if the NO_IOWAIT feature is supported. If a
> +     * non-timeout request is detected, it is set to false to enable accounting.
> +     */
> +    bool no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT);
> +    /* Whether there is a stand-alone timeout SQE or not. */
> +    bool standalone_timeout_sqe = false;
>      int ret;
>  
>      if (timeout == 0) {
> @@ -420,6 +426,9 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>              .tv_nsec = timeout % NANOSECONDS_PER_SECOND,
>          };
>  
> +        /* get_sqe() might call io_uring_submit(), so check early */
> +        standalone_timeout_sqe = io_uring_sq_ready(&ctx->fdmon_io_uring) == 0;
> +
>          sqe = get_sqe(ctx);
>          io_uring_prep_timeout(sqe, &ts, 1, 0);
>          io_uring_sqe_set_data(sqe, NULL);
> @@ -427,13 +436,14 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>  
>      fill_sq_ring(ctx);
>  
> -    /*
> -     * When only the timeout SQE is pending (or none at all), the event loop
> -     * is idle and the blocking wait should not be accounted as IO wait in PSI.
> -     */
> -    no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
> -                io_uring_sq_ready(&ctx->fdmon_io_uring) ==
> -                    (timeout > 0 ? 1 : 0);
> +    if (standalone_timeout_sqe) { /* Re-check if still stand-alone. */
> +        standalone_timeout_sqe = io_uring_sq_ready(&ctx->fdmon_io_uring) == 1;
> +    }
> +
> +    if (timeout > 0 && !standalone_timeout_sqe) {
> +        /* There is a timeout SQE, but not stand-alone. */
> +        no_iowait = false;
> +    }
>  
>      /*
>       * Loop to handle signals in both cases:
> @@ -447,6 +457,19 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>      do {
>          ret = io_uring_submit(&ctx->fdmon_io_uring);
>  
> +        /*
> +         * When only the timeout SQE was submitted (or none at all), the event
> +         * loop is idle and the blocking wait should not be accounted as IO wait
> +         * in PSI. If any other request is detected, enable IO wait accounting.
> +         */
> +        if (ret > (standalone_timeout_sqe ? 1 : 0)) {
> +            no_iowait = false;
> +        }
> +
> +        if (standalone_timeout_sqe && ret > 0) {
> +            standalone_timeout_sqe = false;
> +        }
> +
>          if (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)) {
>              unsigned flags = IORING_ENTER_GETEVENTS;




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait
  2026-04-22 14:55 ` Fiona Ebner
@ 2026-04-23 15:36   ` Fiona Ebner
  2026-04-23 20:39     ` Thomas Lamprecht
  0 siblings, 1 reply; 5+ messages in thread
From: Fiona Ebner @ 2026-04-23 15:36 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

Am 22.04.26 um 4:54 PM schrieb Fiona Ebner:
> Am 14.04.26 um 11:15 PM schrieb Thomas Lamprecht:
>> Based on Jens Axboe's succinct reply [0] on the kernel side:
>>> It's not "IO pressure", it's the useless iowait metric...
>>> [...]
>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>
>> [0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 
>>
> 
> During testing with an RBD stroage (no krbd), I found that compared to
> pve-qemu-kvm=10.1.2-7 the IO wait is still extremely much higher even
> with your patch, for VMs just idling around. And still "more idle"
> correlating to "more IO pressure". I'm seeing 0.0% or 0.x% with
> pve-qemu-kvm=10.1.2-7 and 40-100% even with the patch.
> 
> Maybe there is a separate issue specific to the RBD block driver, but
> not sure why it would be. At least, there is no IO pressure when setting
> the IORING_ENTER_NO_IOWAIT flag unconditonally every time. My
> suggestions below would only lead to setting the flag less often
> compared to the current patch, so not help for this either. Will need to
> investigate more.

So while I haven't looked at what causes librbd to behave differently,
the difference is that there is an additional poll SQE that gets
re-armed at the end of process_cqe_aio_handler() via add_poll_add_sqe().
Reproducer [0].

So if we want to go with the current approach, we would also need to
detect if that single poll SQE is the only one in the ring to be
submitted for the blocking wait. Which seems pretty messy needing to
keep track of that between threads, and adding detection for all
io_uring_submit() callers to see which one submitted it.

Alternative approach is to keep track of the number of in-flight
requests for which accounting should be done. Proposal below [1].

[0] Stand-alone reproducer with poll SQE:

#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <liburing.h>

int main(void) {
    int fd;
    int ret;
    struct io_uring ring;
    struct io_uring_sqe *sqe;

    fd = open("testfile", O_RDWR | O_CREAT, 0644);
    assert(fd >= 0);
    unlink("testfile");
    ret = ftruncate(fd, 4096);
    assert(!ret);

    ret = io_uring_queue_init(128, &ring, 0);
    if (ret != 0) {
        printf("Failed to initialize io_uring\n");
        return ret;
    }

    // before submitting+advancing the issue does not happen
    // ret = io_uring_submit_and_wait(&ring, 1);
    // printf("got ret %d\n", ret);

    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        printf("Full sq\n");
        return -1;
    }

    io_uring_prep_nop(sqe);

    do {
        ret = io_uring_submit_and_wait(&ring, 1);
    } while (ret == -EINTR);

    if (ret != 1) {
        printf("Expected to submit one\n");
        return -1;
    }

    // using peek+seen has the same effect
    // struct io_uring_cqe* cqe;
    // io_uring_peek_cqe(&ring, &cqe);
    // io_uring_cqe_seen(&ring, cqe);
    io_uring_cq_advance(&ring, 1);

    io_uring_prep_poll_add(sqe, fd, 25);

    ret = io_uring_submit_and_wait(&ring, 1);
    printf("got ret %d\n", ret);

    io_uring_queue_exit(&ring);

    return 0;
}

[1] Patch proposal (on top of your patch, WIP):

> diff --git a/include/block/aio.h b/include/block/aio.h
> index 6049e6a0f4..0bdba5d17f 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -77,6 +77,8 @@ struct CqeHandler {
>  
>      /* This field is filled in before ->cb() is called */
>      struct io_uring_cqe cqe;
> +
> +    bool iowait_accounting;
>  };
>  
>  typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
> @@ -317,6 +319,12 @@ struct AioContext {
>  
>      /* Pending callback state for cqe handlers */
>      CqeHandlerSimpleQ cqe_handler_ready_list;
> +
> +    /*
> +     * Number of in-flight requests to be accounted for IO wait.
> +     * Must be accessed using atomics.
> +     */
> +    uint64_t iowait_accounting_reqs;
>  #endif /* CONFIG_LINUX_IO_URING */
>  
>      /* TimerLists for calling timers - one per clock type.  Has its own
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index 3427dbd1a9..b02f8e4229 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -185,9 +185,14 @@ static void fdmon_io_uring_add_sqe(AioContext *ctx,
>  {
>      struct io_uring_sqe *sqe = get_sqe(ctx);
>  
> +    /* Currently, IO wait should be accounted for all requests here. */
> +    cqe_handler->iowait_accounting = true;
> +
>      prep_sqe(sqe, opaque);
>      io_uring_sqe_set_data(sqe, cqe_handler);
>  
> +    qatomic_inc(&ctx->iowait_accounting_reqs);
> +
>      trace_fdmon_io_uring_add_sqe(ctx, opaque, sqe->opcode, sqe->fd, sqe->off,
>                                   cqe_handler);
>  }
> @@ -295,6 +300,10 @@ static bool process_cqe(AioContext *ctx,
>          return false;
>      }
>  
> +    if (cqe_handler->iowait_accounting) {
> +        qatomic_dec(&ctx->iowait_accounting_reqs);
> +    }
> +
>      /*
>       * Special handling for AioHandler cqes. They need ready_list and have a
>       * return value.
> @@ -406,7 +415,6 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>  {
>      struct __kernel_timespec ts;
>      unsigned wait_nr = 1; /* block until at least one cqe is ready */
> -    bool no_iowait;
>      int ret;
>  
>      if (timeout == 0) {
> @@ -427,14 +435,6 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>  
>      fill_sq_ring(ctx);
>  
> -    /*
> -     * When only the timeout SQE is pending (or none at all), the event loop
> -     * is idle and the blocking wait should not be accounted as IO wait in PSI.
> -     */
> -    no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
> -                io_uring_sq_ready(&ctx->fdmon_io_uring) ==
> -                    (timeout > 0 ? 1 : 0);
> -
>      /*
>       * Loop to handle signals in both cases:
>       * 1. If no SQEs were submitted, then -EINTR is returned.
> @@ -450,7 +450,9 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>          if (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)) {
>              unsigned flags = IORING_ENTER_GETEVENTS;
>  
> -            if (no_iowait) {
> +            if ((ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
> +                qatomic_read(&ctx->iowait_accounting_reqs) == 0)
> +            {
>                  flags |= IORING_ENTER_NO_IOWAIT;
>              }





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait
  2026-04-23 15:36   ` Fiona Ebner
@ 2026-04-23 20:39     ` Thomas Lamprecht
  2026-04-24  9:13       ` Fiona Ebner
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Lamprecht @ 2026-04-23 20:39 UTC (permalink / raw)
  To: Fiona Ebner, pve-devel

Am 23.04.26 um 17:34 schrieb Fiona Ebner:
> Am 22.04.26 um 4:54 PM schrieb Fiona Ebner:
>> Am 14.04.26 um 11:15 PM schrieb Thomas Lamprecht:
>>> Based on Jens Axboe's succinct reply [0] on the kernel side:
>>>> It's not "IO pressure", it's the useless iowait metric...
>>>> [...]
>>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>>
>>> [0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 
>>>
>>
>> During testing with an RBD stroage (no krbd), I found that compared to
>> pve-qemu-kvm=10.1.2-7 the IO wait is still extremely much higher even
>> with your patch, for VMs just idling around. And still "more idle"
>> correlating to "more IO pressure". I'm seeing 0.0% or 0.x% with
>> pve-qemu-kvm=10.1.2-7 and 40-100% even with the patch.
>>
>> Maybe there is a separate issue specific to the RBD block driver, but
>> not sure why it would be. At least, there is no IO pressure when setting
>> the IORING_ENTER_NO_IOWAIT flag unconditonally every time. My
>> suggestions below would only lead to setting the flag less often
>> compared to the current patch, so not help for this either. Will need to
>> investigate more.
> 
> So while I haven't looked at what causes librbd to behave differently,
> the difference is that there is an additional poll SQE that gets
> re-armed at the end of process_cqe_aio_handler() via add_poll_add_sqe().
> Reproducer [0].
> 
> So if we want to go with the current approach, we would also need to
> detect if that single poll SQE is the only one in the ring to be
> submitted for the blocking wait. Which seems pretty messy needing to
> keep track of that between threads, and adding detection for all
> io_uring_submit() callers to see which one submitted it.
> 
> Alternative approach is to keep track of the number of in-flight
> requests for which accounting should be done. Proposal below [1].

many thanks for the proposal!

>> diff --git a/include/block/aio.h b/include/block/aio.h
>> index 6049e6a0f4..0bdba5d17f 100644
>> --- a/include/block/aio.h
>> +++ b/include/block/aio.h
>> @@ -77,6 +77,8 @@ struct CqeHandler {
>>  
>>      /* This field is filled in before ->cb() is called */
>>      struct io_uring_cqe cqe;
>> +
>> +    bool iowait_accounting;
>>  };
>>  
>>  typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
>> @@ -317,6 +319,12 @@ struct AioContext {
>>  
>>      /* Pending callback state for cqe handlers */
>>      CqeHandlerSimpleQ cqe_handler_ready_list;
>> +
>> +    /*
>> +     * Number of in-flight requests to be accounted for IO wait.
>> +     * Must be accessed using atomics.

why though? AioContext is strictly single-threaded nowadays and (enqueue)
and process_cqe (dequeue) run on the owning thread. Or is this just
defensive protection for potential future changes - as it's cheap I'm
fine with it, just wanted to know if I'm overlooking something here.

rest look nic(er!), will adapt to this.

>> +     */
>> +    uint64_t iowait_accounting_reqs;
>>  #endif /* CONFIG_LINUX_IO_URING */
>>  
>>      /* TimerLists for calling timers - one per clock type.  Has its own




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait
  2026-04-23 20:39     ` Thomas Lamprecht
@ 2026-04-24  9:13       ` Fiona Ebner
  0 siblings, 0 replies; 5+ messages in thread
From: Fiona Ebner @ 2026-04-24  9:13 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

Am 23.04.26 um 10:37 PM schrieb Thomas Lamprecht:
> Am 23.04.26 um 17:34 schrieb Fiona Ebner:
>> Am 22.04.26 um 4:54 PM schrieb Fiona Ebner:
>>> Am 14.04.26 um 11:15 PM schrieb Thomas Lamprecht:
>>>> Based on Jens Axboe's succinct reply [0] on the kernel side:
>>>>> It's not "IO pressure", it's the useless iowait metric...
>>>>> [...]
>>>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>>>
>>>> [0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 
>>>>
>>>
>>> During testing with an RBD stroage (no krbd), I found that compared to
>>> pve-qemu-kvm=10.1.2-7 the IO wait is still extremely much higher even
>>> with your patch, for VMs just idling around. And still "more idle"
>>> correlating to "more IO pressure". I'm seeing 0.0% or 0.x% with
>>> pve-qemu-kvm=10.1.2-7 and 40-100% even with the patch.
>>>
>>> Maybe there is a separate issue specific to the RBD block driver, but
>>> not sure why it would be. At least, there is no IO pressure when setting
>>> the IORING_ENTER_NO_IOWAIT flag unconditonally every time. My
>>> suggestions below would only lead to setting the flag less often
>>> compared to the current patch, so not help for this either. Will need to
>>> investigate more.
>>
>> So while I haven't looked at what causes librbd to behave differently,
>> the difference is that there is an additional poll SQE that gets
>> re-armed at the end of process_cqe_aio_handler() via add_poll_add_sqe().
>> Reproducer [0].
>>
>> So if we want to go with the current approach, we would also need to
>> detect if that single poll SQE is the only one in the ring to be
>> submitted for the blocking wait. Which seems pretty messy needing to
>> keep track of that between threads, and adding detection for all
>> io_uring_submit() callers to see which one submitted it.
>>
>> Alternative approach is to keep track of the number of in-flight
>> requests for which accounting should be done. Proposal below [1].
> 
> many thanks for the proposal!
> 
>>> diff --git a/include/block/aio.h b/include/block/aio.h
>>> index 6049e6a0f4..0bdba5d17f 100644
>>> --- a/include/block/aio.h
>>> +++ b/include/block/aio.h
>>> @@ -77,6 +77,8 @@ struct CqeHandler {
>>>  
>>>      /* This field is filled in before ->cb() is called */
>>>      struct io_uring_cqe cqe;
>>> +
>>> +    bool iowait_accounting;
>>>  };
>>>  
>>>  typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
>>> @@ -317,6 +319,12 @@ struct AioContext {
>>>  
>>>      /* Pending callback state for cqe handlers */
>>>      CqeHandlerSimpleQ cqe_handler_ready_list;
>>> +
>>> +    /*
>>> +     * Number of in-flight requests to be accounted for IO wait.
>>> +     * Must be accessed using atomics.
> 
> why though? AioContext is strictly single-threaded nowadays and (enqueue)
> and process_cqe (dequeue) run on the owning thread. Or is this just
> defensive protection for potential future changes - as it's cheap I'm
> fine with it, just wanted to know if I'm overlooking something here.

No, looking over it again, I think you are right. I was under the
impression that somehow vCPUs might also add SQEs, but no, that happens
in the iothread already.

But my suggestion only applies to file/blockdev drivers, since others
(like librbd) do not actually call aio_add_sqe(), which only happens via
luring_co_submit() and that only happens in block/file-posix.c




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-24  9:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-14 21:09 [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait Thomas Lamprecht
2026-04-22 14:55 ` Fiona Ebner
2026-04-23 15:36   ` Fiona Ebner
2026-04-23 20:39     ` Thomas Lamprecht
2026-04-24  9:13       ` Fiona Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal