From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 5C7A11FF140
	for <inbox@lore.proxmox.com>; Fri, 24 Apr 2026 11:13:58 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id A017B10E95;
	Fri, 24 Apr 2026 11:13:57 +0200 (CEST)
Message-ID: <525c4dad-6d04-41f0-8a21-9302b0c6baa4@proxmox.com>
Date: Fri, 24 Apr 2026 11:13:16 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH qemu] fdmon-io_uring: avoid idle event loop being
 accounted as IO wait
To: Thomas Lamprecht <t.lamprecht@proxmox.com>, pve-devel@lists.proxmox.com
References: <20260414211600.4023940-1-t.lamprecht@proxmox.com>
 <c2117d42-be80-4397-9420-129a3e77706a@proxmox.com>
 <e2a83933-5bdc-44d0-83d5-106a8d1ff7ad@proxmox.com>
 <3689d164-e7d3-44c1-96b8-2b84b7342dd5@proxmox.com>
Content-Language: en-US
From: Fiona Ebner <f.ebner@proxmox.com>
In-Reply-To: <3689d164-e7d3-44c1-96b8-2b84b7342dd5@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1777021910964
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.008 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: VJ5CSB32XIHNZMCFJXDTGNJAR5ICWDFN
X-Message-ID-Hash: VJ5CSB32XIHNZMCFJXDTGNJAR5ICWDFN
X-MailFrom: f.ebner@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

Am 23.04.26 um 10:37 PM schrieb Thomas Lamprecht:
> Am 23.04.26 um 17:34 schrieb Fiona Ebner:
>> Am 22.04.26 um 4:54 PM schrieb Fiona Ebner:
>>> Am 14.04.26 um 11:15 PM schrieb Thomas Lamprecht:
>>>> Based on Jens Axboe's succinct reply [0] on the kernel side:
>>>>> It's not "IO pressure", it's the useless iowait metric...
>>>>> [...]
>>>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>>>
>>>> [0]: https://lore.kernel.org/all/49a977f3-45da-41dd-9fd6-75fd6760a591@kernel.dk/ 
>>>>
>>>
>>> During testing with an RBD stroage (no krbd), I found that compared to
>>> pve-qemu-kvm=10.1.2-7 the IO wait is still extremely much higher even
>>> with your patch, for VMs just idling around. And still "more idle"
>>> correlating to "more IO pressure". I'm seeing 0.0% or 0.x% with
>>> pve-qemu-kvm=10.1.2-7 and 40-100% even with the patch.
>>>
>>> Maybe there is a separate issue specific to the RBD block driver, but
>>> not sure why it would be. At least, there is no IO pressure when setting
>>> the IORING_ENTER_NO_IOWAIT flag unconditonally every time. My
>>> suggestions below would only lead to setting the flag less often
>>> compared to the current patch, so not help for this either. Will need to
>>> investigate more.
>>
>> So while I haven't looked at what causes librbd to behave differently,
>> the difference is that there is an additional poll SQE that gets
>> re-armed at the end of process_cqe_aio_handler() via add_poll_add_sqe().
>> Reproducer [0].
>>
>> So if we want to go with the current approach, we would also need to
>> detect if that single poll SQE is the only one in the ring to be
>> submitted for the blocking wait. Which seems pretty messy needing to
>> keep track of that between threads, and adding detection for all
>> io_uring_submit() callers to see which one submitted it.
>>
>> Alternative approach is to keep track of the number of in-flight
>> requests for which accounting should be done. Proposal below [1].
> 
> many thanks for the proposal!
> 
>>> diff --git a/include/block/aio.h b/include/block/aio.h
>>> index 6049e6a0f4..0bdba5d17f 100644
>>> --- a/include/block/aio.h
>>> +++ b/include/block/aio.h
>>> @@ -77,6 +77,8 @@ struct CqeHandler {
>>>  
>>>      /* This field is filled in before ->cb() is called */
>>>      struct io_uring_cqe cqe;
>>> +
>>> +    bool iowait_accounting;
>>>  };
>>>  
>>>  typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
>>> @@ -317,6 +319,12 @@ struct AioContext {
>>>  
>>>      /* Pending callback state for cqe handlers */
>>>      CqeHandlerSimpleQ cqe_handler_ready_list;
>>> +
>>> +    /*
>>> +     * Number of in-flight requests to be accounted for IO wait.
>>> +     * Must be accessed using atomics.
> 
> why though? AioContext is strictly single-threaded nowadays and (enqueue)
> and process_cqe (dequeue) run on the owning thread. Or is this just
> defensive protection for potential future changes - as it's cheap I'm
> fine with it, just wanted to know if I'm overlooking something here.

No, looking over it again, I think you are right. I was under the
impression that somehow vCPUs might also add SQEs, but no, that happens
in the iothread already.

But my suggestion only applies to file/blockdev drivers, since others
(like librbd) do not actually call aio_add_sqe(), which only happens via
luring_co_submit() and that only happens in block/file-posix.c