[pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck

* [pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck
@ 2025-11-26 11:03 Fiona Ebner
  2025-12-19  8:19 ` Fabian Grünbichler
  0 siblings, 1 reply; 2+ messages in thread
From: Fiona Ebner @ 2025-11-26 11:03 UTC (permalink / raw)
  To: pve-devel

As reported in the community forum [0], backups with TPM disk on BTRFS
storages would likely get stuck since the switch to '-blockdev' in
qemu-server, i.e. commit 439f6e2a ("backup: use blockdev for TPM state
file"), which moved away from using a 'drive_add' HMP command to
attach the TPM state drive. Before that, the aio mode was QEMU's
default 'threads'. Since that commit, the aio mode is the PVE default
'io_uring'.

The issue is actually not BTRFS-specific, but a logic bug in QEMU's
current io_uring implementation, see the added patch for details. QEMU
10.2 includes a major rework of the io_uring feature, so the issue is
already fixed there with QEMU commit 047dabef97 ("block/io_uring: use
aio_add_sqe()"). While still on 10.1, a different fix is needed.

Upstream submission for the patch [1].

[0]: https://forum.proxmox.com/threads/170045/
[1]: https://lists.nongnu.org/archive/html/qemu-stable/2025-11/msg00321.html

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 ...void-potentially-getting-stuck-after.patch | 153 ++++++++++++++++++
 debian/patches/series                         |   1 +
 2 files changed, 154 insertions(+)
 create mode 100644 debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch

diff --git a/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
new file mode 100644
index 0000000..372ecad
--- /dev/null
+++ b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
@@ -0,0 +1,153 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Fiona Ebner <f.ebner@proxmox.com>
+Date: Mon, 24 Nov 2025 15:28:27 +0100
+Subject: [PATCH] block/io_uring: avoid potentially getting stuck after
+ resubmit at the end of ioq_submit()
+
+Note that this issue seems already fixed as a consequence of the large
+io_uring rework with 047dabef97 ("block/io_uring: use aio_add_sqe()")
+in current master, so this is purely for QEMU stable branches.
+
+At the end of ioq_submit(), there is an opportunistic call to
+luring_process_completions(). This is the single caller of
+luring_process_completions() that doesn't use the
+luring_process_completions_and_submit() wrapper.
+
+Other callers use the wrapper, because luring_process_completions()
+might require a subsequent call to ioq_submit() after resubmitting a
+request. As noted for luring_resubmit():
+
+> Resubmit a request by appending it to submit_queue.  The caller must ensure
+> that ioq_submit() is called later so that submit_queue requests are started.
+
+So the caller at the end of ioq_submit() violates the contract and can
+in fact be problematic if no other requests come in later. In such a
+case, the request intended to be resubmitted will never be actually be
+submitted via io_uring_submit().
+
+A reproducer exposing this issue is [0], which is based on user
+reports from [1]. Another reproducer is iotest 109 with '-i io_uring'.
+
+I had the most success to trigger the issue with [0] when using a
+BTRFS RAID 1 storage. With tmpfs, it can take quite a few iterations,
+but also triggers eventually on my machine. With iotest 109 with '-i
+io_uring' the issue triggers reliably on my ext4 file system.
+
+Have ioq_submit() submit any resubmitted requests after calling
+luring_process_completions(). The return value from io_uring_submit()
+is checked to be non-negative before the opportunistic processing of
+completions and going for the new resubmit logic, to ensure that a
+failure of io_uring_submit() is not missed. Also note that the return
+value already was not necessarily the total number of submissions,
+since the loop might've been iterated more than once even before the
+current change.
+
+Only trigger the resubmission logic if it is actually necessary to
+avoid changing behavior more than necessary. For example iotest 109
+would produce more 'mirror ready' events if always resubmitting after
+luring_process_completions() at the end of ioq_submit().
+
+Note iotest 109 still does not pass as is when run with '-i io_uring',
+because of two offset values for BLOCK_JOB_COMPLETED events being zero
+instead of non-zero as in the expected output. Note that the two
+affected test cases are expected failures and still fail, so they just
+fail "faster". The test cases are actually not triggering the resubmit
+logic, so the reason seems to be different ordering of requests and
+completions of the current aio=io_uring implementation versus
+aio=threads.
+
+[0]:
+
+> #!/bin/bash -e
+> #file=/mnt/btrfs/disk.raw
+> file=/tmp/disk.raw
+> filesize=256
+> readsize=512
+> rm -f $file
+> truncate -s $filesize $file
+> ./qemu-system-x86_64 --trace '*uring*' --qmp stdio \
+> --blockdev raw,node-name=node0,file.driver=file,file.cache.direct=off,file.filename=$file,file.aio=io_uring \
+> <<EOF
+> {"execute": "qmp_capabilities"}
+> {"execute": "human-monitor-command", "arguments": { "command-line": "qemu-io node0 \"read 0 $readsize \"" }}
+> {"execute": "quit"}
+> EOF
+
+[1]: https://forum.proxmox.com/threads/170045/
+
+Cc: qemu-stable@nongnu.org
+Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
+---
+ block/io_uring.c | 16 +++++++++++++---
+ 1 file changed, 13 insertions(+), 3 deletions(-)
+
+diff --git a/block/io_uring.c b/block/io_uring.c
+index dd4f304910..5dbafc8f7b 100644
+--- a/block/io_uring.c
++++ b/block/io_uring.c
+@@ -120,11 +120,14 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
+  * event loop.  When there are no events left  to complete the BH is being
+  * canceled.
+  *
++ * Returns whether ioq_submit() must be called again afterwards since requests
++ * were resubmitted via luring_resubmit().
+  */
+-static void luring_process_completions(LuringState *s)
++static bool luring_process_completions(LuringState *s)
+ {
+     struct io_uring_cqe *cqes;
+     int total_bytes;
++    bool resubmit = false;
+ 
+     defer_call_begin();
+ 
+@@ -182,6 +185,7 @@ static void luring_process_completions(LuringState *s)
+              */
+             if (ret == -EINTR || ret == -EAGAIN) {
+                 luring_resubmit(s, luringcb);
++                resubmit = true;
+                 continue;
+             }
+         } else if (!luringcb->qiov) {
+@@ -194,6 +198,7 @@ static void luring_process_completions(LuringState *s)
+             if (luringcb->is_read) {
+                 if (ret > 0) {
+                     luring_resubmit_short_read(s, luringcb, ret);
++                    resubmit = true;
+                     continue;
+                 } else {
+                     /* Pad with zeroes */
+@@ -224,6 +229,8 @@ end:
+     qemu_bh_cancel(s->completion_bh);
+ 
+     defer_call_end();
++
++    return resubmit;
+ }
+ 
+ static int ioq_submit(LuringState *s)
+@@ -231,6 +238,7 @@ static int ioq_submit(LuringState *s)
+     int ret = 0;
+     LuringAIOCB *luringcb, *luringcb_next;
+ 
++resubmit:
+     while (s->io_q.in_queue > 0) {
+         /*
+          * Try to fetch sqes from the ring for requests waiting in
+@@ -260,12 +268,14 @@ static int ioq_submit(LuringState *s)
+     }
+     s->io_q.blocked = (s->io_q.in_queue > 0);
+ 
+-    if (s->io_q.in_flight) {
++    if (ret >= 0 && s->io_q.in_flight) {
+         /*
+          * We can try to complete something just right away if there are
+          * still requests in-flight.
+          */
+-        luring_process_completions(s);
++        if (luring_process_completions(s)) {
++            goto resubmit;
++        }
+     }
+     return ret;
+ }
diff --git a/debian/patches/series b/debian/patches/series
index b1afcd4..83e7f6d 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -8,6 +8,7 @@ extra/0007-vfio-only-check-region-info-cache-for-initial-region.patch
 extra/0008-ui-vdagent-fix-windows-agent-regression.patch
 extra/0009-file-posix-populate-pwrite_zeroes_alignment.patch
 extra/0010-block-use-pwrite_zeroes_alignment-when-writing-first.patch
+extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
 bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
 bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
 bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 2+ messages in thread