From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id ACDF71FF13F
	for <inbox@lore.proxmox.com>; Thu, 23 Apr 2026 17:37:01 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 2336E1EA73;
	Thu, 23 Apr 2026 17:37:01 +0200 (CEST)
Message-ID: <e2a83933-5bdc-44d0-83d5-106a8d1ff7ad@proxmox.com>
Date: Thu, 23 Apr 2026 17:36:26 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH qemu] fdmon-io_uring: avoid idle event loop being
 accounted as IO wait
From: Fiona Ebner <f.ebner@proxmox.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>, pve-devel@lists.proxmox.com
References: <20260414211600.4023940-1-t.lamprecht@proxmox.com>
 <c2117d42-be80-4397-9420-129a3e77706a@proxmox.com>
Content-Language: en-US
In-Reply-To: <c2117d42-be80-4397-9420-129a3e77706a@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1776958497154
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.008 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: VJUM62I3YO2IHVUWEHI4IJHGECFSWLAM
X-Message-ID-Hash: VJUM62I3YO2IHVUWEHI4IJHGECFSWLAM
X-MailFrom: f.ebner@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

Am 22.04.26 um 4:54 PM schrieb Fiona Ebner:
> Am 14.04.26 um 11:15 PM schrieb Thomas Lamprecht:
>> Based on Jens Axboe's succinct reply [0] on the kernel side:
>>> It's not "IO pressure", it's the useless iowait metric...
>>> [...]
>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>
>> [0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 
>>
> 
> During testing with an RBD stroage (no krbd), I found that compared to
> pve-qemu-kvm=10.1.2-7 the IO wait is still extremely much higher even
> with your patch, for VMs just idling around. And still "more idle"
> correlating to "more IO pressure". I'm seeing 0.0% or 0.x% with
> pve-qemu-kvm=10.1.2-7 and 40-100% even with the patch.
> 
> Maybe there is a separate issue specific to the RBD block driver, but
> not sure why it would be. At least, there is no IO pressure when setting
> the IORING_ENTER_NO_IOWAIT flag unconditonally every time. My
> suggestions below would only lead to setting the flag less often
> compared to the current patch, so not help for this either. Will need to
> investigate more.

So while I haven't looked at what causes librbd to behave differently,
the difference is that there is an additional poll SQE that gets
re-armed at the end of process_cqe_aio_handler() via add_poll_add_sqe().
Reproducer [0].

So if we want to go with the current approach, we would also need to
detect if that single poll SQE is the only one in the ring to be
submitted for the blocking wait. Which seems pretty messy needing to
keep track of that between threads, and adding detection for all
io_uring_submit() callers to see which one submitted it.

Alternative approach is to keep track of the number of in-flight
requests for which accounting should be done. Proposal below [1].

[0] Stand-alone reproducer with poll SQE:

#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <liburing.h>

int main(void) {
    int fd;
    int ret;
    struct io_uring ring;
    struct io_uring_sqe *sqe;

    fd = open("testfile", O_RDWR | O_CREAT, 0644);
    assert(fd >= 0);
    unlink("testfile");
    ret = ftruncate(fd, 4096);
    assert(!ret);

    ret = io_uring_queue_init(128, &ring, 0);
    if (ret != 0) {
        printf("Failed to initialize io_uring\n");
        return ret;
    }

    // before submitting+advancing the issue does not happen
    // ret = io_uring_submit_and_wait(&ring, 1);
    // printf("got ret %d\n", ret);

    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        printf("Full sq\n");
        return -1;
    }

    io_uring_prep_nop(sqe);

    do {
        ret = io_uring_submit_and_wait(&ring, 1);
    } while (ret == -EINTR);

    if (ret != 1) {
        printf("Expected to submit one\n");
        return -1;
    }

    // using peek+seen has the same effect
    // struct io_uring_cqe* cqe;
    // io_uring_peek_cqe(&ring, &cqe);
    // io_uring_cqe_seen(&ring, cqe);
    io_uring_cq_advance(&ring, 1);

    io_uring_prep_poll_add(sqe, fd, 25);

    ret = io_uring_submit_and_wait(&ring, 1);
    printf("got ret %d\n", ret);

    io_uring_queue_exit(&ring);

    return 0;
}

[1] Patch proposal (on top of your patch, WIP):

> diff --git a/include/block/aio.h b/include/block/aio.h
> index 6049e6a0f4..0bdba5d17f 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -77,6 +77,8 @@ struct CqeHandler {
>  
>      /* This field is filled in before ->cb() is called */
>      struct io_uring_cqe cqe;
> +
> +    bool iowait_accounting;
>  };
>  
>  typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
> @@ -317,6 +319,12 @@ struct AioContext {
>  
>      /* Pending callback state for cqe handlers */
>      CqeHandlerSimpleQ cqe_handler_ready_list;
> +
> +    /*
> +     * Number of in-flight requests to be accounted for IO wait.
> +     * Must be accessed using atomics.
> +     */
> +    uint64_t iowait_accounting_reqs;
>  #endif /* CONFIG_LINUX_IO_URING */
>  
>      /* TimerLists for calling timers - one per clock type.  Has its own
> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> index 3427dbd1a9..b02f8e4229 100644
> --- a/util/fdmon-io_uring.c
> +++ b/util/fdmon-io_uring.c
> @@ -185,9 +185,14 @@ static void fdmon_io_uring_add_sqe(AioContext *ctx,
>  {
>      struct io_uring_sqe *sqe = get_sqe(ctx);
>  
> +    /* Currently, IO wait should be accounted for all requests here. */
> +    cqe_handler->iowait_accounting = true;
> +
>      prep_sqe(sqe, opaque);
>      io_uring_sqe_set_data(sqe, cqe_handler);
>  
> +    qatomic_inc(&ctx->iowait_accounting_reqs);
> +
>      trace_fdmon_io_uring_add_sqe(ctx, opaque, sqe->opcode, sqe->fd, sqe->off,
>                                   cqe_handler);
>  }
> @@ -295,6 +300,10 @@ static bool process_cqe(AioContext *ctx,
>          return false;
>      }
>  
> +    if (cqe_handler->iowait_accounting) {
> +        qatomic_dec(&ctx->iowait_accounting_reqs);
> +    }
> +
>      /*
>       * Special handling for AioHandler cqes. They need ready_list and have a
>       * return value.
> @@ -406,7 +415,6 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>  {
>      struct __kernel_timespec ts;
>      unsigned wait_nr = 1; /* block until at least one cqe is ready */
> -    bool no_iowait;
>      int ret;
>  
>      if (timeout == 0) {
> @@ -427,14 +435,6 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>  
>      fill_sq_ring(ctx);
>  
> -    /*
> -     * When only the timeout SQE is pending (or none at all), the event loop
> -     * is idle and the blocking wait should not be accounted as IO wait in PSI.
> -     */
> -    no_iowait = (ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
> -                io_uring_sq_ready(&ctx->fdmon_io_uring) ==
> -                    (timeout > 0 ? 1 : 0);
> -
>      /*
>       * Loop to handle signals in both cases:
>       * 1. If no SQEs were submitted, then -EINTR is returned.
> @@ -450,7 +450,9 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
>          if (ret >= 0 && wait_nr > io_uring_cq_ready(&ctx->fdmon_io_uring)) {
>              unsigned flags = IORING_ENTER_GETEVENTS;
>  
> -            if (no_iowait) {
> +            if ((ctx->fdmon_io_uring.features & IORING_FEAT_NO_IOWAIT) &&
> +                qatomic_read(&ctx->iowait_accounting_reqs) == 0)
> +            {
>                  flags |= IORING_ENTER_NO_IOWAIT;
>              }