all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Fabian Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [RFC/PATCH qemu] PVE-Backup: avoid segfault issues upon backup-cancel
Date: Tue, 24 May 2022 13:30:50 +0200	[thread overview]
Message-ID: <20220524113050.179182-1-f.ebner@proxmox.com> (raw)

When canceling a backup in PVE via a signal it's easy to run into a
situation where the job is already failing when the backup_cancel QMP
command comes in. With a bit of unlucky timing on top, it can happen
that job_exit() runs between schedulung of job_cancel_bh() and
execution of job_cancel_bh(). But job_cancel_sync() does not expect
that the job is already finalized (in fact, the job might've been
freed already, but even if it isn't, job_cancel_sync() would try to
deref job->txn which would be NULL at that point).

It is not possible to simply use the job_cancel() (which is advertised
as being async but isn't in all cases) in qmp_backup_cancel() for the
same reason job_cancel_sync() cannot be used. Namely, because it can
invoke job_finish_sync() (which uses AIO_WAIT_WHILE and thus hangs if
called from a coroutine). This happens when there's multiple jobs in
the transaction and job->deferred_to_main_loop is true (is set before
scheduling job_exit()) or if the job was not started yet.

Fix the issue by selecting the job to cancel in job_cancel_bh() itself
using the first job that's not completed yet. This is not necessarily
the first job in the list, because pvebackup_co_complete_stream()
might not yet have removed a completed job when job_cancel_bh() runs.

An alternative would be to continue using only the first job and
checking against JOB_STATUS_CONCLUDED|JOB_STATUS_NULL to decide if
it's still necessary and possible to cancel, but the approach with
using the first non-completed job seemed more robust.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

Intended to be ordered after
0038-PVE-Backup-Don-t-block-on-finishing-and-cleanup-crea.patch
or could also be squashed into that (while lifting the commit message
to the main repo). Of course I can also send that directly if this is
ACKed.

 pve-backup.c | 72 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/pve-backup.c b/pve-backup.c
index 6f05796fad..d0da6b63be 100644
--- a/pve-backup.c
+++ b/pve-backup.c
@@ -345,15 +345,45 @@ static void pvebackup_complete_cb(void *opaque, int ret)
 
 /*
  * job_cancel(_sync) does not like to be called from coroutines, so defer to
- * main loop processing via a bottom half.
+ * main loop processing via a bottom half. Assumes that caller holds
+ * backup_mutex and called job_ref on all jobs in backup_state.di_list.
  */
 static void job_cancel_bh(void *opaque) {
     CoCtxData *data = (CoCtxData*)opaque;
-    Job *job = (Job*)data->data;
-    AioContext *job_ctx = job->aio_context;
-    aio_context_acquire(job_ctx);
-    job_cancel_sync(job, true);
-    aio_context_release(job_ctx);
+
+    /*
+     * It's enough to cancel one job in the transaction, the rest will follow
+     * automatically.
+     */
+    bool canceled = false;
+
+    /*
+     * Be careful to pick a valid job to cancel:
+     * 1. job_cancel_sync() does not expect the job to be finalized already.
+     * 2. job_exit() might run between scheduling and running job_cancel_bh()
+     *    and pvebackup_co_complete_stream() might not have removed the job from
+     *    the list yet (in fact, cannot, because it waits for the backup_mutex).
+     * Requiring !job_is_completed() ensures that no finalized job is picked.
+     */
+    GList *bdi = g_list_first(backup_state.di_list);
+    while (bdi) {
+        if (bdi->data) {
+            BlockJob *bj = ((PVEBackupDevInfo *)bdi->data)->job;
+            if (bj) {
+                Job *job = &bj->job;
+                if (!canceled && !job_is_completed(job)) {
+                    AioContext *job_ctx = job->aio_context;
+                    aio_context_acquire(job_ctx);
+                    job_cancel_sync(job, true);
+                    aio_context_release(job_ctx);
+                    canceled = true;
+                }
+                job_unref(job);
+            }
+        }
+        bdi = g_list_next(bdi);
+    }
+
     aio_co_enter(data->ctx, data->co);
 }
 
@@ -374,23 +404,25 @@ static void coroutine_fn pvebackup_co_cancel(void *opaque)
         proxmox_backup_abort(backup_state.pbs, "backup canceled");
     }
 
-    /* it's enough to cancel one job in the transaction, the rest will follow
-     * automatically */
     GList *bdi = g_list_first(backup_state.di_list);
-    BlockJob *cancel_job = bdi && bdi->data ?
-        ((PVEBackupDevInfo *)bdi->data)->job :
-        NULL;
-
-    if (cancel_job) {
-        CoCtxData data = {
-            .ctx = qemu_get_current_aio_context(),
-            .co = qemu_coroutine_self(),
-            .data = &cancel_job->job,
-        };
-        aio_bh_schedule_oneshot(data.ctx, job_cancel_bh, &data);
-        qemu_coroutine_yield();
+    while (bdi) {
+        if (bdi->data) {
+            BlockJob *bj = ((PVEBackupDevInfo *)bdi->data)->job;
+            if (bj) {
+                /* ensure job is not freed before job_cancel_bh() runs */
+                job_ref(&bj->job);
+            }
+        }
+        bdi = g_list_next(bdi);
     }
 
+    CoCtxData data = {
+        .ctx = qemu_get_current_aio_context(),
+        .co = qemu_coroutine_self(),
+    };
+    aio_bh_schedule_oneshot(data.ctx, job_cancel_bh, &data);
+    qemu_coroutine_yield();
+
     qemu_co_mutex_unlock(&backup_state.backup_mutex);
 }
 
-- 
2.30.2





             reply	other threads:[~2022-05-24 11:30 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-24 11:30 Fabian Ebner [this message]
2022-05-25  8:10 ` Fabian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220524113050.179182-1-f.ebner@proxmox.com \
    --to=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal