* [pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck
@ 2025-11-26 11:03 Fiona Ebner
0 siblings, 0 replies; only message in thread
From: Fiona Ebner @ 2025-11-26 11:03 UTC (permalink / raw)
To: pve-devel
As reported in the community forum [0], backups with TPM disk on BTRFS
storages would likely get stuck since the switch to '-blockdev' in
qemu-server, i.e. commit 439f6e2a ("backup: use blockdev for TPM state
file"), which moved away from using a 'drive_add' HMP command to
attach the TPM state drive. Before that, the aio mode was QEMU's
default 'threads'. Since that commit, the aio mode is the PVE default
'io_uring'.
The issue is actually not BTRFS-specific, but a logic bug in QEMU's
current io_uring implementation, see the added patch for details. QEMU
10.2 includes a major rework of the io_uring feature, so the issue is
already fixed there with QEMU commit 047dabef97 ("block/io_uring: use
aio_add_sqe()"). While still on 10.1, a different fix is needed.
Upstream submission for the patch [1].
[0]: https://forum.proxmox.com/threads/170045/
[1]: https://lists.nongnu.org/archive/html/qemu-stable/2025-11/msg00321.html
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
...void-potentially-getting-stuck-after.patch | 153 ++++++++++++++++++
debian/patches/series | 1 +
2 files changed, 154 insertions(+)
create mode 100644 debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
diff --git a/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
new file mode 100644
index 0000000..372ecad
--- /dev/null
+++ b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
@@ -0,0 +1,153 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Fiona Ebner <f.ebner@proxmox.com>
+Date: Mon, 24 Nov 2025 15:28:27 +0100
+Subject: [PATCH] block/io_uring: avoid potentially getting stuck after
+ resubmit at the end of ioq_submit()
+
+Note that this issue seems already fixed as a consequence of the large
+io_uring rework with 047dabef97 ("block/io_uring: use aio_add_sqe()")
+in current master, so this is purely for QEMU stable branches.
+
+At the end of ioq_submit(), there is an opportunistic call to
+luring_process_completions(). This is the single caller of
+luring_process_completions() that doesn't use the
+luring_process_completions_and_submit() wrapper.
+
+Other callers use the wrapper, because luring_process_completions()
+might require a subsequent call to ioq_submit() after resubmitting a
+request. As noted for luring_resubmit():
+
+> Resubmit a request by appending it to submit_queue. The caller must ensure
+> that ioq_submit() is called later so that submit_queue requests are started.
+
+So the caller at the end of ioq_submit() violates the contract and can
+in fact be problematic if no other requests come in later. In such a
+case, the request intended to be resubmitted will never be actually be
+submitted via io_uring_submit().
+
+A reproducer exposing this issue is [0], which is based on user
+reports from [1]. Another reproducer is iotest 109 with '-i io_uring'.
+
+I had the most success to trigger the issue with [0] when using a
+BTRFS RAID 1 storage. With tmpfs, it can take quite a few iterations,
+but also triggers eventually on my machine. With iotest 109 with '-i
+io_uring' the issue triggers reliably on my ext4 file system.
+
+Have ioq_submit() submit any resubmitted requests after calling
+luring_process_completions(). The return value from io_uring_submit()
+is checked to be non-negative before the opportunistic processing of
+completions and going for the new resubmit logic, to ensure that a
+failure of io_uring_submit() is not missed. Also note that the return
+value already was not necessarily the total number of submissions,
+since the loop might've been iterated more than once even before the
+current change.
+
+Only trigger the resubmission logic if it is actually necessary to
+avoid changing behavior more than necessary. For example iotest 109
+would produce more 'mirror ready' events if always resubmitting after
+luring_process_completions() at the end of ioq_submit().
+
+Note iotest 109 still does not pass as is when run with '-i io_uring',
+because of two offset values for BLOCK_JOB_COMPLETED events being zero
+instead of non-zero as in the expected output. Note that the two
+affected test cases are expected failures and still fail, so they just
+fail "faster". The test cases are actually not triggering the resubmit
+logic, so the reason seems to be different ordering of requests and
+completions of the current aio=io_uring implementation versus
+aio=threads.
+
+[0]:
+
+> #!/bin/bash -e
+> #file=/mnt/btrfs/disk.raw
+> file=/tmp/disk.raw
+> filesize=256
+> readsize=512
+> rm -f $file
+> truncate -s $filesize $file
+> ./qemu-system-x86_64 --trace '*uring*' --qmp stdio \
+> --blockdev raw,node-name=node0,file.driver=file,file.cache.direct=off,file.filename=$file,file.aio=io_uring \
+> <<EOF
+> {"execute": "qmp_capabilities"}
+> {"execute": "human-monitor-command", "arguments": { "command-line": "qemu-io node0 \"read 0 $readsize \"" }}
+> {"execute": "quit"}
+> EOF
+
+[1]: https://forum.proxmox.com/threads/170045/
+
+Cc: qemu-stable@nongnu.org
+Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
+---
+ block/io_uring.c | 16 +++++++++++++---
+ 1 file changed, 13 insertions(+), 3 deletions(-)
+
+diff --git a/block/io_uring.c b/block/io_uring.c
+index dd4f304910..5dbafc8f7b 100644
+--- a/block/io_uring.c
++++ b/block/io_uring.c
+@@ -120,11 +120,14 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
+ * event loop. When there are no events left to complete the BH is being
+ * canceled.
+ *
++ * Returns whether ioq_submit() must be called again afterwards since requests
++ * were resubmitted via luring_resubmit().
+ */
+-static void luring_process_completions(LuringState *s)
++static bool luring_process_completions(LuringState *s)
+ {
+ struct io_uring_cqe *cqes;
+ int total_bytes;
++ bool resubmit = false;
+
+ defer_call_begin();
+
+@@ -182,6 +185,7 @@ static void luring_process_completions(LuringState *s)
+ */
+ if (ret == -EINTR || ret == -EAGAIN) {
+ luring_resubmit(s, luringcb);
++ resubmit = true;
+ continue;
+ }
+ } else if (!luringcb->qiov) {
+@@ -194,6 +198,7 @@ static void luring_process_completions(LuringState *s)
+ if (luringcb->is_read) {
+ if (ret > 0) {
+ luring_resubmit_short_read(s, luringcb, ret);
++ resubmit = true;
+ continue;
+ } else {
+ /* Pad with zeroes */
+@@ -224,6 +229,8 @@ end:
+ qemu_bh_cancel(s->completion_bh);
+
+ defer_call_end();
++
++ return resubmit;
+ }
+
+ static int ioq_submit(LuringState *s)
+@@ -231,6 +238,7 @@ static int ioq_submit(LuringState *s)
+ int ret = 0;
+ LuringAIOCB *luringcb, *luringcb_next;
+
++resubmit:
+ while (s->io_q.in_queue > 0) {
+ /*
+ * Try to fetch sqes from the ring for requests waiting in
+@@ -260,12 +268,14 @@ static int ioq_submit(LuringState *s)
+ }
+ s->io_q.blocked = (s->io_q.in_queue > 0);
+
+- if (s->io_q.in_flight) {
++ if (ret >= 0 && s->io_q.in_flight) {
+ /*
+ * We can try to complete something just right away if there are
+ * still requests in-flight.
+ */
+- luring_process_completions(s);
++ if (luring_process_completions(s)) {
++ goto resubmit;
++ }
+ }
+ return ret;
+ }
diff --git a/debian/patches/series b/debian/patches/series
index b1afcd4..83e7f6d 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -8,6 +8,7 @@ extra/0007-vfio-only-check-region-info-cache-for-initial-region.patch
extra/0008-ui-vdagent-fix-windows-agent-regression.patch
extra/0009-file-posix-populate-pwrite_zeroes_alignment.patch
extra/0010-block-use-pwrite_zeroes_alignment-when-writing-first.patch
+extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2025-11-26 11:03 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-26 11:03 [pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck Fiona Ebner
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.