Re: [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait

all lists on lists.proxmox.com
 help / color / mirror / Atom feed

From: Fiona Ebner <f.ebner@proxmox.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>, pve-devel@lists.proxmox.com
Subject: Re: [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait
Date: Fri, 24 Apr 2026 11:13:16 +0200	[thread overview]
Message-ID: <525c4dad-6d04-41f0-8a21-9302b0c6baa4@proxmox.com> (raw)
In-Reply-To: <3689d164-e7d3-44c1-96b8-2b84b7342dd5@proxmox.com>

Am 23.04.26 um 10:37 PM schrieb Thomas Lamprecht:
> Am 23.04.26 um 17:34 schrieb Fiona Ebner:
>> Am 22.04.26 um 4:54 PM schrieb Fiona Ebner:
>>> Am 14.04.26 um 11:15 PM schrieb Thomas Lamprecht:
>>>> Based on Jens Axboe's succinct reply [0] on the kernel side:
>>>>> It's not "IO pressure", it's the useless iowait metric...
>>>>> [...]
>>>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>>>
>>>> [0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 
>>>>
>>>
>>> During testing with an RBD stroage (no krbd), I found that compared to
>>> pve-qemu-kvm=10.1.2-7 the IO wait is still extremely much higher even
>>> with your patch, for VMs just idling around. And still "more idle"
>>> correlating to "more IO pressure". I'm seeing 0.0% or 0.x% with
>>> pve-qemu-kvm=10.1.2-7 and 40-100% even with the patch.
>>>
>>> Maybe there is a separate issue specific to the RBD block driver, but
>>> not sure why it would be. At least, there is no IO pressure when setting
>>> the IORING_ENTER_NO_IOWAIT flag unconditonally every time. My
>>> suggestions below would only lead to setting the flag less often
>>> compared to the current patch, so not help for this either. Will need to
>>> investigate more.
>>
>> So while I haven't looked at what causes librbd to behave differently,
>> the difference is that there is an additional poll SQE that gets
>> re-armed at the end of process_cqe_aio_handler() via add_poll_add_sqe().
>> Reproducer [0].
>>
>> So if we want to go with the current approach, we would also need to
>> detect if that single poll SQE is the only one in the ring to be
>> submitted for the blocking wait. Which seems pretty messy needing to
>> keep track of that between threads, and adding detection for all
>> io_uring_submit() callers to see which one submitted it.
>>
>> Alternative approach is to keep track of the number of in-flight
>> requests for which accounting should be done. Proposal below [1].
> 
> many thanks for the proposal!
> 
>>> diff --git a/include/block/aio.h b/include/block/aio.h
>>> index 6049e6a0f4..0bdba5d17f 100644
>>> --- a/include/block/aio.h
>>> +++ b/include/block/aio.h
>>> @@ -77,6 +77,8 @@ struct CqeHandler {
>>>  
>>>      /* This field is filled in before ->cb() is called */
>>>      struct io_uring_cqe cqe;
>>> +
>>> +    bool iowait_accounting;
>>>  };
>>>  
>>>  typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
>>> @@ -317,6 +319,12 @@ struct AioContext {
>>>  
>>>      /* Pending callback state for cqe handlers */
>>>      CqeHandlerSimpleQ cqe_handler_ready_list;
>>> +
>>> +    /*
>>> +     * Number of in-flight requests to be accounted for IO wait.
>>> +     * Must be accessed using atomics.
> 
> why though? AioContext is strictly single-threaded nowadays and (enqueue)
> and process_cqe (dequeue) run on the owning thread. Or is this just
> defensive protection for potential future changes - as it's cheap I'm
> fine with it, just wanted to know if I'm overlooking something here.

No, looking over it again, I think you are right. I was under the
impression that somehow vCPUs might also add SQEs, but no, that happens
in the iothread already.

But my suggestion only applies to file/blockdev drivers, since others
(like librbd) do not actually call aio_add_sqe(), which only happens via
luring_co_submit() and that only happens in block/file-posix.c

     prev parent reply	other threads:[~2026-04-24  9:13 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-14 21:09 [PATCH qemu] fdmon-io_uring: avoid idle event loop being accounted as IO wait Thomas Lamprecht
2026-04-22 14:55 ` Fiona Ebner
2026-04-23 15:36   ` Fiona Ebner
2026-04-23 20:39     ` Thomas Lamprecht
2026-04-24  9:13       ` Fiona Ebner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=525c4dad-6d04-41f0-8a21-9302b0c6baa4@proxmox.com \
    --to=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal