* [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes @ 2020-10-22 12:11 Stefan Reiter 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Stefan Reiter @ 2020-10-22 12:11 UTC (permalink / raw) To: pve-devel Two smaller bugfixes for qmp_backup_cancel, that would lead to VM hangs or wrongly aborted backups. Sent as seperate patches to highlight the changes, but can probably be squashed into some of our other patches as well (lmk if I should do that). I also got dirty bitmap migrate working, but still needs some testing/cleanup. I'd also like for upstream to give their final decision on the fix I sent, then I'll confidently send that as well :) @Dominik: If you apply these two patches, can you still reproduce the hang on abort? I'm not sure we experienced the 'same' hang. Stefan Reiter (2): PVE: Don't expect complete_cb to be called outside coroutine PVE: Don't call job_cancel in coroutines pve-backup.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine 2020-10-22 12:11 [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Stefan Reiter @ 2020-10-22 12:11 ` Stefan Reiter 2020-10-27 14:16 ` Wolfgang Bumiller 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines Stefan Reiter 2020-10-22 13:07 ` [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Dominik Csapak 2 siblings, 1 reply; 8+ messages in thread From: Stefan Reiter @ 2020-10-22 12:11 UTC (permalink / raw) To: pve-devel We're at the mercy of the rest of QEMU here, and it sometimes decides to call pvebackup_complete_cb from a coroutine. This really doesn't matter to us, so don't assert and crash on it. Signed-off-by: Stefan Reiter <s.reiter@proxmox.com> --- pve-backup.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/pve-backup.c b/pve-backup.c index 53cf23ed5a..9179754dcb 100644 --- a/pve-backup.c +++ b/pve-backup.c @@ -318,19 +318,18 @@ static void coroutine_fn pvebackup_co_complete_stream(void *opaque) static void pvebackup_complete_cb(void *opaque, int ret) { - assert(!qemu_in_coroutine()); - PVEBackupDevInfo *di = opaque; di->completed_ret = ret; /* * Schedule stream cleanup in async coroutine. close_image and finish might - * take a while, so we can't block on them here. + * take a while, so we can't block on them here. This way it also doesn't + * matter if we're already running in a coroutine or not. * Note: di is a pointer to an entry in the global backup_state struct, so * it stays valid. */ Coroutine *co = qemu_coroutine_create(pvebackup_co_complete_stream, di); - aio_co_schedule(qemu_get_aio_context(), co); + aio_co_enter(qemu_get_aio_context(), co); } static void coroutine_fn pvebackup_co_cancel(void *opaque) -- 2.20.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter @ 2020-10-27 14:16 ` Wolfgang Bumiller 2020-10-27 14:57 ` Stefan Reiter 0 siblings, 1 reply; 8+ messages in thread From: Wolfgang Bumiller @ 2020-10-27 14:16 UTC (permalink / raw) To: Stefan Reiter; +Cc: pve-devel On Thu, Oct 22, 2020 at 02:11:17PM +0200, Stefan Reiter wrote: > We're at the mercy of the rest of QEMU here, and it sometimes decides to > call pvebackup_complete_cb from a coroutine. This really doesn't matter > to us, so don't assert and crash on it. > > Signed-off-by: Stefan Reiter <s.reiter@proxmox.com> > --- > pve-backup.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/pve-backup.c b/pve-backup.c > index 53cf23ed5a..9179754dcb 100644 > --- a/pve-backup.c > +++ b/pve-backup.c > @@ -318,19 +318,18 @@ static void coroutine_fn pvebackup_co_complete_stream(void *opaque) > > static void pvebackup_complete_cb(void *opaque, int ret) > { > - assert(!qemu_in_coroutine()); > - > PVEBackupDevInfo *di = opaque; > di->completed_ret = ret; > > /* > * Schedule stream cleanup in async coroutine. close_image and finish might > - * take a while, so we can't block on them here. > + * take a while, so we can't block on them here. This way it also doesn't > + * matter if we're already running in a coroutine or not. > * Note: di is a pointer to an entry in the global backup_state struct, so > * it stays valid. > */ > Coroutine *co = qemu_coroutine_create(pvebackup_co_complete_stream, di); > - aio_co_schedule(qemu_get_aio_context(), co); > + aio_co_enter(qemu_get_aio_context(), co); Shouldn't this be decided based on `qemu_in_coroutine()`? Or are we allowed to call enter regardless, I forgot...? > } > > static void coroutine_fn pvebackup_co_cancel(void *opaque) > -- > 2.20.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine 2020-10-27 14:16 ` Wolfgang Bumiller @ 2020-10-27 14:57 ` Stefan Reiter 0 siblings, 0 replies; 8+ messages in thread From: Stefan Reiter @ 2020-10-27 14:57 UTC (permalink / raw) To: Wolfgang Bumiller; +Cc: pve-devel On 10/27/20 3:16 PM, Wolfgang Bumiller wrote: > On Thu, Oct 22, 2020 at 02:11:17PM +0200, Stefan Reiter wrote: >> We're at the mercy of the rest of QEMU here, and it sometimes decides to >> call pvebackup_complete_cb from a coroutine. This really doesn't matter >> to us, so don't assert and crash on it. >> >> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com> >> --- >> pve-backup.c | 7 +++---- >> 1 file changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/pve-backup.c b/pve-backup.c >> index 53cf23ed5a..9179754dcb 100644 >> --- a/pve-backup.c >> +++ b/pve-backup.c >> @@ -318,19 +318,18 @@ static void coroutine_fn pvebackup_co_complete_stream(void *opaque) >> >> static void pvebackup_complete_cb(void *opaque, int ret) >> { >> - assert(!qemu_in_coroutine()); >> - >> PVEBackupDevInfo *di = opaque; >> di->completed_ret = ret; >> >> /* >> * Schedule stream cleanup in async coroutine. close_image and finish might >> - * take a while, so we can't block on them here. >> + * take a while, so we can't block on them here. This way it also doesn't >> + * matter if we're already running in a coroutine or not. >> * Note: di is a pointer to an entry in the global backup_state struct, so >> * it stays valid. >> */ >> Coroutine *co = qemu_coroutine_create(pvebackup_co_complete_stream, di); >> - aio_co_schedule(qemu_get_aio_context(), co); >> + aio_co_enter(qemu_get_aio_context(), co); > > Shouldn't this be decided based on `qemu_in_coroutine()`? Or are we > allowed to call enter regardless, I forgot...? > We are allowed to call whenever, if we're in a coroutine in the same context this becomes a call, if not it's scheduled correctly. >> } >> >> static void coroutine_fn pvebackup_co_cancel(void *opaque) >> -- >> 2.20.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines 2020-10-22 12:11 [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Stefan Reiter 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter @ 2020-10-22 12:11 ` Stefan Reiter 2020-10-27 14:17 ` Wolfgang Bumiller 2020-10-22 13:07 ` [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Dominik Csapak 2 siblings, 1 reply; 8+ messages in thread From: Stefan Reiter @ 2020-10-22 12:11 UTC (permalink / raw) To: pve-devel ...because it hangs on cancelling other jobs in the txn if you do. Signed-off-by: Stefan Reiter <s.reiter@proxmox.com> --- pve-backup.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/pve-backup.c b/pve-backup.c index 9179754dcb..af2db0d4b9 100644 --- a/pve-backup.c +++ b/pve-backup.c @@ -82,6 +82,12 @@ typedef struct PVEBackupDevInfo { BlockJob *job; } PVEBackupDevInfo; +typedef struct JobCancelData { + AioContext *ctx; + Coroutine *co; + Job *job; +} JobCancelData; + static void pvebackup_propagate_error(Error *err) { qemu_mutex_lock(&backup_state.stat.lock); @@ -332,6 +338,18 @@ static void pvebackup_complete_cb(void *opaque, int ret) aio_co_enter(qemu_get_aio_context(), co); } +/* + * job_cancel(_sync) does not like to be called from coroutines, so defer to + * main loop processing via a bottom half. + */ +static void job_cancel_bh(void *opaque) { + JobCancelData *data = (JobCancelData*)opaque; + aio_context_acquire(data->job->aio_context); + job_cancel_sync(data->job); + aio_context_release(data->job->aio_context); + aio_co_schedule(data->ctx, data->co); +} + static void coroutine_fn pvebackup_co_cancel(void *opaque) { Error *cancel_err = NULL; @@ -357,7 +375,13 @@ static void coroutine_fn pvebackup_co_cancel(void *opaque) NULL; if (cancel_job) { - job_cancel(&cancel_job->job, false); + JobCancelData data = { + .ctx = qemu_get_current_aio_context(), + .co = qemu_coroutine_self(), + .job = &cancel_job->job, + }; + aio_bh_schedule_oneshot(data.ctx, job_cancel_bh, &data); + qemu_coroutine_yield(); } qemu_co_mutex_unlock(&backup_state.backup_mutex); -- 2.20.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines Stefan Reiter @ 2020-10-27 14:17 ` Wolfgang Bumiller 2020-10-27 14:57 ` Stefan Reiter 0 siblings, 1 reply; 8+ messages in thread From: Wolfgang Bumiller @ 2020-10-27 14:17 UTC (permalink / raw) To: Stefan Reiter; +Cc: pve-devel On Thu, Oct 22, 2020 at 02:11:18PM +0200, Stefan Reiter wrote: > ...because it hangs on cancelling other jobs in the txn if you do. > > Signed-off-by: Stefan Reiter <s.reiter@proxmox.com> > --- > pve-backup.c | 26 +++++++++++++++++++++++++- > 1 file changed, 25 insertions(+), 1 deletion(-) > > diff --git a/pve-backup.c b/pve-backup.c > index 9179754dcb..af2db0d4b9 100644 > --- a/pve-backup.c > +++ b/pve-backup.c > @@ -82,6 +82,12 @@ typedef struct PVEBackupDevInfo { > BlockJob *job; > } PVEBackupDevInfo; > > +typedef struct JobCancelData { > + AioContext *ctx; > + Coroutine *co; > + Job *job; > +} JobCancelData; > + > static void pvebackup_propagate_error(Error *err) > { > qemu_mutex_lock(&backup_state.stat.lock); > @@ -332,6 +338,18 @@ static void pvebackup_complete_cb(void *opaque, int ret) > aio_co_enter(qemu_get_aio_context(), co); > } > > +/* > + * job_cancel(_sync) does not like to be called from coroutines, so defer to > + * main loop processing via a bottom half. > + */ > +static void job_cancel_bh(void *opaque) { > + JobCancelData *data = (JobCancelData*)opaque; > + aio_context_acquire(data->job->aio_context); > + job_cancel_sync(data->job); > + aio_context_release(data->job->aio_context); > + aio_co_schedule(data->ctx, data->co); > +} > + > static void coroutine_fn pvebackup_co_cancel(void *opaque) > { > Error *cancel_err = NULL; > @@ -357,7 +375,13 @@ static void coroutine_fn pvebackup_co_cancel(void *opaque) > NULL; > > if (cancel_job) { > - job_cancel(&cancel_job->job, false); > + JobCancelData data = { > + .ctx = qemu_get_current_aio_context(), > + .co = qemu_coroutine_self(), > + .job = &cancel_job->job, > + }; > + aio_bh_schedule_oneshot(data.ctx, job_cancel_bh, &data); > + qemu_coroutine_yield(); Don't we need some kind of synchronization here? The yield does not guarantee we don't run before the bh is run, or does it? Maybe a condvar to trigger the coro after the job cancel bh? > } > > qemu_co_mutex_unlock(&backup_state.backup_mutex); > -- > 2.20.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines 2020-10-27 14:17 ` Wolfgang Bumiller @ 2020-10-27 14:57 ` Stefan Reiter 0 siblings, 0 replies; 8+ messages in thread From: Stefan Reiter @ 2020-10-27 14:57 UTC (permalink / raw) To: Wolfgang Bumiller; +Cc: pve-devel On 10/27/20 3:17 PM, Wolfgang Bumiller wrote: > On Thu, Oct 22, 2020 at 02:11:18PM +0200, Stefan Reiter wrote: >> ...because it hangs on cancelling other jobs in the txn if you do. >> >> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com> >> --- >> pve-backup.c | 26 +++++++++++++++++++++++++- >> 1 file changed, 25 insertions(+), 1 deletion(-) >> >> diff --git a/pve-backup.c b/pve-backup.c >> index 9179754dcb..af2db0d4b9 100644 >> --- a/pve-backup.c >> +++ b/pve-backup.c >> @@ -82,6 +82,12 @@ typedef struct PVEBackupDevInfo { >> BlockJob *job; >> } PVEBackupDevInfo; >> >> +typedef struct JobCancelData { >> + AioContext *ctx; >> + Coroutine *co; >> + Job *job; >> +} JobCancelData; >> + >> static void pvebackup_propagate_error(Error *err) >> { >> qemu_mutex_lock(&backup_state.stat.lock); >> @@ -332,6 +338,18 @@ static void pvebackup_complete_cb(void *opaque, int ret) >> aio_co_enter(qemu_get_aio_context(), co); >> } >> >> +/* >> + * job_cancel(_sync) does not like to be called from coroutines, so defer to >> + * main loop processing via a bottom half. >> + */ >> +static void job_cancel_bh(void *opaque) { >> + JobCancelData *data = (JobCancelData*)opaque; >> + aio_context_acquire(data->job->aio_context); >> + job_cancel_sync(data->job); >> + aio_context_release(data->job->aio_context); >> + aio_co_schedule(data->ctx, data->co); >> +} >> + >> static void coroutine_fn pvebackup_co_cancel(void *opaque) >> { >> Error *cancel_err = NULL; >> @@ -357,7 +375,13 @@ static void coroutine_fn pvebackup_co_cancel(void *opaque) >> NULL; >> >> if (cancel_job) { >> - job_cancel(&cancel_job->job, false); >> + JobCancelData data = { >> + .ctx = qemu_get_current_aio_context(), >> + .co = qemu_coroutine_self(), >> + .job = &cancel_job->job, >> + }; >> + aio_bh_schedule_oneshot(data.ctx, job_cancel_bh, &data); >> + qemu_coroutine_yield(); > > Don't we need some kind of synchronization here? The yield does not > guarantee we don't run before the bh is run, or does it? Maybe a condvar > to trigger the coro after the job cancel bh? > No, it cannot race, since we execute the BH in the same context as the coroutine (qemu_get_current_aio_context()). The coroutine thus blocks execution of the BH until it yields. See also code and comment in aio_co_reschedule_self() from 'util/async.c'. >> } >> >> qemu_co_mutex_unlock(&backup_state.backup_mutex); >> -- >> 2.20.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes 2020-10-22 12:11 [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Stefan Reiter 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines Stefan Reiter @ 2020-10-22 13:07 ` Dominik Csapak 2 siblings, 0 replies; 8+ messages in thread From: Dominik Csapak @ 2020-10-22 13:07 UTC (permalink / raw) To: Stefan Reiter, pve-devel no code review, as i am not very qemu-coroutine savvy but i tested it and it solves my original problem short summary of it: starting a backup that runs into a timeout and then trying to cancel it resulted in a hanging qemu process and open backup task (on pbs side) that finished only when killing either end with these patches, that is solved and canceling the backup works as intended On 10/22/20 2:11 PM, Stefan Reiter wrote: > Two smaller bugfixes for qmp_backup_cancel, that would lead to VM hangs or > wrongly aborted backups. Sent as seperate patches to highlight the changes, but > can probably be squashed into some of our other patches as well (lmk if I should > do that). > > I also got dirty bitmap migrate working, but still needs some testing/cleanup. > I'd also like for upstream to give their final decision on the fix I sent, then > I'll confidently send that as well :) > > @Dominik: If you apply these two patches, can you still reproduce the hang on > abort? I'm not sure we experienced the 'same' hang. > > > Stefan Reiter (2): > PVE: Don't expect complete_cb to be called outside coroutine > PVE: Don't call job_cancel in coroutines > > pve-backup.c | 33 ++++++++++++++++++++++++++++----- > 1 file changed, 28 insertions(+), 5 deletions(-) > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-10-27 14:57 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-22 12:11 [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Stefan Reiter 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter 2020-10-27 14:16 ` Wolfgang Bumiller 2020-10-27 14:57 ` Stefan Reiter 2020-10-22 12:11 ` [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines Stefan Reiter 2020-10-27 14:17 ` Wolfgang Bumiller 2020-10-27 14:57 ` Stefan Reiter 2020-10-22 13:07 ` [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Dominik Csapak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox