* [pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck
@ 2025-11-26 11:03 Fiona Ebner
2025-12-19 8:19 ` Fabian Grünbichler
0 siblings, 1 reply; 2+ messages in thread
From: Fiona Ebner @ 2025-11-26 11:03 UTC (permalink / raw)
To: pve-devel
As reported in the community forum [0], backups with TPM disk on BTRFS
storages would likely get stuck since the switch to '-blockdev' in
qemu-server, i.e. commit 439f6e2a ("backup: use blockdev for TPM state
file"), which moved away from using a 'drive_add' HMP command to
attach the TPM state drive. Before that, the aio mode was QEMU's
default 'threads'. Since that commit, the aio mode is the PVE default
'io_uring'.
The issue is actually not BTRFS-specific, but a logic bug in QEMU's
current io_uring implementation, see the added patch for details. QEMU
10.2 includes a major rework of the io_uring feature, so the issue is
already fixed there with QEMU commit 047dabef97 ("block/io_uring: use
aio_add_sqe()"). While still on 10.1, a different fix is needed.
Upstream submission for the patch [1].
[0]: https://forum.proxmox.com/threads/170045/
[1]: https://lists.nongnu.org/archive/html/qemu-stable/2025-11/msg00321.html
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
...void-potentially-getting-stuck-after.patch | 153 ++++++++++++++++++
debian/patches/series | 1 +
2 files changed, 154 insertions(+)
create mode 100644 debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
diff --git a/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
new file mode 100644
index 0000000..372ecad
--- /dev/null
+++ b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
@@ -0,0 +1,153 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Fiona Ebner <f.ebner@proxmox.com>
+Date: Mon, 24 Nov 2025 15:28:27 +0100
+Subject: [PATCH] block/io_uring: avoid potentially getting stuck after
+ resubmit at the end of ioq_submit()
+
+Note that this issue seems already fixed as a consequence of the large
+io_uring rework with 047dabef97 ("block/io_uring: use aio_add_sqe()")
+in current master, so this is purely for QEMU stable branches.
+
+At the end of ioq_submit(), there is an opportunistic call to
+luring_process_completions(). This is the single caller of
+luring_process_completions() that doesn't use the
+luring_process_completions_and_submit() wrapper.
+
+Other callers use the wrapper, because luring_process_completions()
+might require a subsequent call to ioq_submit() after resubmitting a
+request. As noted for luring_resubmit():
+
+> Resubmit a request by appending it to submit_queue. The caller must ensure
+> that ioq_submit() is called later so that submit_queue requests are started.
+
+So the caller at the end of ioq_submit() violates the contract and can
+in fact be problematic if no other requests come in later. In such a
+case, the request intended to be resubmitted will never be actually be
+submitted via io_uring_submit().
+
+A reproducer exposing this issue is [0], which is based on user
+reports from [1]. Another reproducer is iotest 109 with '-i io_uring'.
+
+I had the most success to trigger the issue with [0] when using a
+BTRFS RAID 1 storage. With tmpfs, it can take quite a few iterations,
+but also triggers eventually on my machine. With iotest 109 with '-i
+io_uring' the issue triggers reliably on my ext4 file system.
+
+Have ioq_submit() submit any resubmitted requests after calling
+luring_process_completions(). The return value from io_uring_submit()
+is checked to be non-negative before the opportunistic processing of
+completions and going for the new resubmit logic, to ensure that a
+failure of io_uring_submit() is not missed. Also note that the return
+value already was not necessarily the total number of submissions,
+since the loop might've been iterated more than once even before the
+current change.
+
+Only trigger the resubmission logic if it is actually necessary to
+avoid changing behavior more than necessary. For example iotest 109
+would produce more 'mirror ready' events if always resubmitting after
+luring_process_completions() at the end of ioq_submit().
+
+Note iotest 109 still does not pass as is when run with '-i io_uring',
+because of two offset values for BLOCK_JOB_COMPLETED events being zero
+instead of non-zero as in the expected output. Note that the two
+affected test cases are expected failures and still fail, so they just
+fail "faster". The test cases are actually not triggering the resubmit
+logic, so the reason seems to be different ordering of requests and
+completions of the current aio=io_uring implementation versus
+aio=threads.
+
+[0]:
+
+> #!/bin/bash -e
+> #file=/mnt/btrfs/disk.raw
+> file=/tmp/disk.raw
+> filesize=256
+> readsize=512
+> rm -f $file
+> truncate -s $filesize $file
+> ./qemu-system-x86_64 --trace '*uring*' --qmp stdio \
+> --blockdev raw,node-name=node0,file.driver=file,file.cache.direct=off,file.filename=$file,file.aio=io_uring \
+> <<EOF
+> {"execute": "qmp_capabilities"}
+> {"execute": "human-monitor-command", "arguments": { "command-line": "qemu-io node0 \"read 0 $readsize \"" }}
+> {"execute": "quit"}
+> EOF
+
+[1]: https://forum.proxmox.com/threads/170045/
+
+Cc: qemu-stable@nongnu.org
+Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
+---
+ block/io_uring.c | 16 +++++++++++++---
+ 1 file changed, 13 insertions(+), 3 deletions(-)
+
+diff --git a/block/io_uring.c b/block/io_uring.c
+index dd4f304910..5dbafc8f7b 100644
+--- a/block/io_uring.c
++++ b/block/io_uring.c
+@@ -120,11 +120,14 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
+ * event loop. When there are no events left to complete the BH is being
+ * canceled.
+ *
++ * Returns whether ioq_submit() must be called again afterwards since requests
++ * were resubmitted via luring_resubmit().
+ */
+-static void luring_process_completions(LuringState *s)
++static bool luring_process_completions(LuringState *s)
+ {
+ struct io_uring_cqe *cqes;
+ int total_bytes;
++ bool resubmit = false;
+
+ defer_call_begin();
+
+@@ -182,6 +185,7 @@ static void luring_process_completions(LuringState *s)
+ */
+ if (ret == -EINTR || ret == -EAGAIN) {
+ luring_resubmit(s, luringcb);
++ resubmit = true;
+ continue;
+ }
+ } else if (!luringcb->qiov) {
+@@ -194,6 +198,7 @@ static void luring_process_completions(LuringState *s)
+ if (luringcb->is_read) {
+ if (ret > 0) {
+ luring_resubmit_short_read(s, luringcb, ret);
++ resubmit = true;
+ continue;
+ } else {
+ /* Pad with zeroes */
+@@ -224,6 +229,8 @@ end:
+ qemu_bh_cancel(s->completion_bh);
+
+ defer_call_end();
++
++ return resubmit;
+ }
+
+ static int ioq_submit(LuringState *s)
+@@ -231,6 +238,7 @@ static int ioq_submit(LuringState *s)
+ int ret = 0;
+ LuringAIOCB *luringcb, *luringcb_next;
+
++resubmit:
+ while (s->io_q.in_queue > 0) {
+ /*
+ * Try to fetch sqes from the ring for requests waiting in
+@@ -260,12 +268,14 @@ static int ioq_submit(LuringState *s)
+ }
+ s->io_q.blocked = (s->io_q.in_queue > 0);
+
+- if (s->io_q.in_flight) {
++ if (ret >= 0 && s->io_q.in_flight) {
+ /*
+ * We can try to complete something just right away if there are
+ * still requests in-flight.
+ */
+- luring_process_completions(s);
++ if (luring_process_completions(s)) {
++ goto resubmit;
++ }
+ }
+ return ret;
+ }
diff --git a/debian/patches/series b/debian/patches/series
index b1afcd4..83e7f6d 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -8,6 +8,7 @@ extra/0007-vfio-only-check-region-info-cache-for-initial-region.patch
extra/0008-ui-vdagent-fix-windows-agent-regression.patch
extra/0009-file-posix-populate-pwrite_zeroes_alignment.patch
extra/0010-block-use-pwrite_zeroes_alignment-when-writing-first.patch
+extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck
2025-11-26 11:03 [pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck Fiona Ebner
@ 2025-12-19 8:19 ` Fabian Grünbichler
0 siblings, 0 replies; 2+ messages in thread
From: Fabian Grünbichler @ 2025-12-19 8:19 UTC (permalink / raw)
To: Proxmox VE development discussion; +Cc: Thomas Lamprecht
On November 26, 2025 12:03 pm, Fiona Ebner wrote:
> As reported in the community forum [0], backups with TPM disk on BTRFS
> storages would likely get stuck since the switch to '-blockdev' in
> qemu-server, i.e. commit 439f6e2a ("backup: use blockdev for TPM state
> file"), which moved away from using a 'drive_add' HMP command to
> attach the TPM state drive. Before that, the aio mode was QEMU's
> default 'threads'. Since that commit, the aio mode is the PVE default
> 'io_uring'.
>
> The issue is actually not BTRFS-specific, but a logic bug in QEMU's
> current io_uring implementation, see the added patch for details. QEMU
> 10.2 includes a major rework of the io_uring feature, so the issue is
> already fixed there with QEMU commit 047dabef97 ("block/io_uring: use
> aio_add_sqe()"). While still on 10.1, a different fix is needed.
>
> Upstream submission for the patch [1].
>
> [0]: https://forum.proxmox.com/threads/170045/
> [1]: https://lists.nongnu.org/archive/html/qemu-stable/2025-11/msg00321.html
this has been reviewed and applied for 10.1.3 upstream, so we should
either pull it in or update to 10.1.3 (I think 10.2 still had other issues?)
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> ...void-potentially-getting-stuck-after.patch | 153 ++++++++++++++++++
> debian/patches/series | 1 +
> 2 files changed, 154 insertions(+)
> create mode 100644 debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
>
> diff --git a/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> new file mode 100644
> index 0000000..372ecad
> --- /dev/null
> +++ b/debian/patches/extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> @@ -0,0 +1,153 @@
> +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
> +From: Fiona Ebner <f.ebner@proxmox.com>
> +Date: Mon, 24 Nov 2025 15:28:27 +0100
> +Subject: [PATCH] block/io_uring: avoid potentially getting stuck after
> + resubmit at the end of ioq_submit()
> +
> +Note that this issue seems already fixed as a consequence of the large
> +io_uring rework with 047dabef97 ("block/io_uring: use aio_add_sqe()")
> +in current master, so this is purely for QEMU stable branches.
> +
> +At the end of ioq_submit(), there is an opportunistic call to
> +luring_process_completions(). This is the single caller of
> +luring_process_completions() that doesn't use the
> +luring_process_completions_and_submit() wrapper.
> +
> +Other callers use the wrapper, because luring_process_completions()
> +might require a subsequent call to ioq_submit() after resubmitting a
> +request. As noted for luring_resubmit():
> +
> +> Resubmit a request by appending it to submit_queue. The caller must ensure
> +> that ioq_submit() is called later so that submit_queue requests are started.
> +
> +So the caller at the end of ioq_submit() violates the contract and can
> +in fact be problematic if no other requests come in later. In such a
> +case, the request intended to be resubmitted will never be actually be
> +submitted via io_uring_submit().
> +
> +A reproducer exposing this issue is [0], which is based on user
> +reports from [1]. Another reproducer is iotest 109 with '-i io_uring'.
> +
> +I had the most success to trigger the issue with [0] when using a
> +BTRFS RAID 1 storage. With tmpfs, it can take quite a few iterations,
> +but also triggers eventually on my machine. With iotest 109 with '-i
> +io_uring' the issue triggers reliably on my ext4 file system.
> +
> +Have ioq_submit() submit any resubmitted requests after calling
> +luring_process_completions(). The return value from io_uring_submit()
> +is checked to be non-negative before the opportunistic processing of
> +completions and going for the new resubmit logic, to ensure that a
> +failure of io_uring_submit() is not missed. Also note that the return
> +value already was not necessarily the total number of submissions,
> +since the loop might've been iterated more than once even before the
> +current change.
> +
> +Only trigger the resubmission logic if it is actually necessary to
> +avoid changing behavior more than necessary. For example iotest 109
> +would produce more 'mirror ready' events if always resubmitting after
> +luring_process_completions() at the end of ioq_submit().
> +
> +Note iotest 109 still does not pass as is when run with '-i io_uring',
> +because of two offset values for BLOCK_JOB_COMPLETED events being zero
> +instead of non-zero as in the expected output. Note that the two
> +affected test cases are expected failures and still fail, so they just
> +fail "faster". The test cases are actually not triggering the resubmit
> +logic, so the reason seems to be different ordering of requests and
> +completions of the current aio=io_uring implementation versus
> +aio=threads.
> +
> +[0]:
> +
> +> #!/bin/bash -e
> +> #file=/mnt/btrfs/disk.raw
> +> file=/tmp/disk.raw
> +> filesize=256
> +> readsize=512
> +> rm -f $file
> +> truncate -s $filesize $file
> +> ./qemu-system-x86_64 --trace '*uring*' --qmp stdio \
> +> --blockdev raw,node-name=node0,file.driver=file,file.cache.direct=off,file.filename=$file,file.aio=io_uring \
> +> <<EOF
> +> {"execute": "qmp_capabilities"}
> +> {"execute": "human-monitor-command", "arguments": { "command-line": "qemu-io node0 \"read 0 $readsize \"" }}
> +> {"execute": "quit"}
> +> EOF
> +
> +[1]: https://forum.proxmox.com/threads/170045/
> +
> +Cc: qemu-stable@nongnu.org
> +Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> +---
> + block/io_uring.c | 16 +++++++++++++---
> + 1 file changed, 13 insertions(+), 3 deletions(-)
> +
> +diff --git a/block/io_uring.c b/block/io_uring.c
> +index dd4f304910..5dbafc8f7b 100644
> +--- a/block/io_uring.c
> ++++ b/block/io_uring.c
> +@@ -120,11 +120,14 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
> + * event loop. When there are no events left to complete the BH is being
> + * canceled.
> + *
> ++ * Returns whether ioq_submit() must be called again afterwards since requests
> ++ * were resubmitted via luring_resubmit().
> + */
> +-static void luring_process_completions(LuringState *s)
> ++static bool luring_process_completions(LuringState *s)
> + {
> + struct io_uring_cqe *cqes;
> + int total_bytes;
> ++ bool resubmit = false;
> +
> + defer_call_begin();
> +
> +@@ -182,6 +185,7 @@ static void luring_process_completions(LuringState *s)
> + */
> + if (ret == -EINTR || ret == -EAGAIN) {
> + luring_resubmit(s, luringcb);
> ++ resubmit = true;
> + continue;
> + }
> + } else if (!luringcb->qiov) {
> +@@ -194,6 +198,7 @@ static void luring_process_completions(LuringState *s)
> + if (luringcb->is_read) {
> + if (ret > 0) {
> + luring_resubmit_short_read(s, luringcb, ret);
> ++ resubmit = true;
> + continue;
> + } else {
> + /* Pad with zeroes */
> +@@ -224,6 +229,8 @@ end:
> + qemu_bh_cancel(s->completion_bh);
> +
> + defer_call_end();
> ++
> ++ return resubmit;
> + }
> +
> + static int ioq_submit(LuringState *s)
> +@@ -231,6 +238,7 @@ static int ioq_submit(LuringState *s)
> + int ret = 0;
> + LuringAIOCB *luringcb, *luringcb_next;
> +
> ++resubmit:
> + while (s->io_q.in_queue > 0) {
> + /*
> + * Try to fetch sqes from the ring for requests waiting in
> +@@ -260,12 +268,14 @@ static int ioq_submit(LuringState *s)
> + }
> + s->io_q.blocked = (s->io_q.in_queue > 0);
> +
> +- if (s->io_q.in_flight) {
> ++ if (ret >= 0 && s->io_q.in_flight) {
> + /*
> + * We can try to complete something just right away if there are
> + * still requests in-flight.
> + */
> +- luring_process_completions(s);
> ++ if (luring_process_completions(s)) {
> ++ goto resubmit;
> ++ }
> + }
> + return ret;
> + }
> diff --git a/debian/patches/series b/debian/patches/series
> index b1afcd4..83e7f6d 100644
> --- a/debian/patches/series
> +++ b/debian/patches/series
> @@ -8,6 +8,7 @@ extra/0007-vfio-only-check-region-info-cache-for-initial-region.patch
> extra/0008-ui-vdagent-fix-windows-agent-regression.patch
> extra/0009-file-posix-populate-pwrite_zeroes_alignment.patch
> extra/0010-block-use-pwrite_zeroes_alignment-when-writing-first.patch
> +extra/0011-block-io_uring-avoid-potentially-getting-stuck-after.patch
> bitmap-mirror/0001-drive-mirror-add-support-for-sync-bitmap-mode-never.patch
> bitmap-mirror/0002-drive-mirror-add-support-for-conditional-and-always-.patch
> bitmap-mirror/0003-mirror-add-check-for-bitmap-mode-without-bitmap.patch
> --
> 2.47.3
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
>
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-12-19 8:18 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-26 11:03 [pve-devel] [PATCH qemu] add fix for io_uring issue to unbreak backup with TPM disk potentially getting stuck Fiona Ebner
2025-12-19 8:19 ` Fabian Grünbichler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox