public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes
@ 2020-10-22 12:11 Stefan Reiter
  2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Stefan Reiter @ 2020-10-22 12:11 UTC (permalink / raw)
  To: pve-devel

Two smaller bugfixes for qmp_backup_cancel, that would lead to VM hangs or
wrongly aborted backups. Sent as seperate patches to highlight the changes, but
can probably be squashed into some of our other patches as well (lmk if I should
do that).

I also got dirty bitmap migrate working, but still needs some testing/cleanup.
I'd also like for upstream to give their final decision on the fix I sent, then
I'll confidently send that as well :)

@Dominik: If you apply these two patches, can you still reproduce the hang on
abort? I'm not sure we experienced the 'same' hang.


Stefan Reiter (2):
  PVE: Don't expect complete_cb to be called outside coroutine
  PVE: Don't call job_cancel in coroutines

 pve-backup.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

-- 
2.20.1





^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine
  2020-10-22 12:11 [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Stefan Reiter
@ 2020-10-22 12:11 ` Stefan Reiter
  2020-10-27 14:16   ` Wolfgang Bumiller
  2020-10-22 12:11 ` [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines Stefan Reiter
  2020-10-22 13:07 ` [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Dominik Csapak
  2 siblings, 1 reply; 8+ messages in thread
From: Stefan Reiter @ 2020-10-22 12:11 UTC (permalink / raw)
  To: pve-devel

We're at the mercy of the rest of QEMU here, and it sometimes decides to
call pvebackup_complete_cb from a coroutine. This really doesn't matter
to us, so don't assert and crash on it.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
 pve-backup.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/pve-backup.c b/pve-backup.c
index 53cf23ed5a..9179754dcb 100644
--- a/pve-backup.c
+++ b/pve-backup.c
@@ -318,19 +318,18 @@ static void coroutine_fn pvebackup_co_complete_stream(void *opaque)
 
 static void pvebackup_complete_cb(void *opaque, int ret)
 {
-    assert(!qemu_in_coroutine());
-
     PVEBackupDevInfo *di = opaque;
     di->completed_ret = ret;
 
     /*
      * Schedule stream cleanup in async coroutine. close_image and finish might
-     * take a while, so we can't block on them here.
+     * take a while, so we can't block on them here. This way it also doesn't
+     * matter if we're already running in a coroutine or not.
      * Note: di is a pointer to an entry in the global backup_state struct, so
      * it stays valid.
      */
     Coroutine *co = qemu_coroutine_create(pvebackup_co_complete_stream, di);
-    aio_co_schedule(qemu_get_aio_context(), co);
+    aio_co_enter(qemu_get_aio_context(), co);
 }
 
 static void coroutine_fn pvebackup_co_cancel(void *opaque)
-- 
2.20.1





^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines
  2020-10-22 12:11 [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Stefan Reiter
  2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter
@ 2020-10-22 12:11 ` Stefan Reiter
  2020-10-27 14:17   ` Wolfgang Bumiller
  2020-10-22 13:07 ` [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Dominik Csapak
  2 siblings, 1 reply; 8+ messages in thread
From: Stefan Reiter @ 2020-10-22 12:11 UTC (permalink / raw)
  To: pve-devel

...because it hangs on cancelling other jobs in the txn if you do.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
 pve-backup.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/pve-backup.c b/pve-backup.c
index 9179754dcb..af2db0d4b9 100644
--- a/pve-backup.c
+++ b/pve-backup.c
@@ -82,6 +82,12 @@ typedef struct PVEBackupDevInfo {
     BlockJob *job;
 } PVEBackupDevInfo;
 
+typedef struct JobCancelData {
+    AioContext *ctx;
+    Coroutine *co;
+    Job *job;
+} JobCancelData;
+
 static void pvebackup_propagate_error(Error *err)
 {
     qemu_mutex_lock(&backup_state.stat.lock);
@@ -332,6 +338,18 @@ static void pvebackup_complete_cb(void *opaque, int ret)
     aio_co_enter(qemu_get_aio_context(), co);
 }
 
+/*
+ * job_cancel(_sync) does not like to be called from coroutines, so defer to
+ * main loop processing via a bottom half.
+ */
+static void job_cancel_bh(void *opaque) {
+    JobCancelData *data = (JobCancelData*)opaque;
+    aio_context_acquire(data->job->aio_context);
+    job_cancel_sync(data->job);
+    aio_context_release(data->job->aio_context);
+    aio_co_schedule(data->ctx, data->co);
+}
+
 static void coroutine_fn pvebackup_co_cancel(void *opaque)
 {
     Error *cancel_err = NULL;
@@ -357,7 +375,13 @@ static void coroutine_fn pvebackup_co_cancel(void *opaque)
         NULL;
 
     if (cancel_job) {
-        job_cancel(&cancel_job->job, false);
+        JobCancelData data = {
+            .ctx = qemu_get_current_aio_context(),
+            .co = qemu_coroutine_self(),
+            .job = &cancel_job->job,
+        };
+        aio_bh_schedule_oneshot(data.ctx, job_cancel_bh, &data);
+        qemu_coroutine_yield();
     }
 
     qemu_co_mutex_unlock(&backup_state.backup_mutex);
-- 
2.20.1





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes
  2020-10-22 12:11 [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Stefan Reiter
  2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter
  2020-10-22 12:11 ` [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines Stefan Reiter
@ 2020-10-22 13:07 ` Dominik Csapak
  2 siblings, 0 replies; 8+ messages in thread
From: Dominik Csapak @ 2020-10-22 13:07 UTC (permalink / raw)
  To: Stefan Reiter, pve-devel

no code review, as i am not very qemu-coroutine savvy
but i tested it and it solves my original problem

short summary of it:

starting a backup that runs into a timeout and then trying to cancel
it resulted in a hanging qemu process and open backup task
(on pbs side) that finished only when killing either end

with these patches, that is solved and canceling the backup works
as intended

On 10/22/20 2:11 PM, Stefan Reiter wrote:
> Two smaller bugfixes for qmp_backup_cancel, that would lead to VM hangs or
> wrongly aborted backups. Sent as seperate patches to highlight the changes, but
> can probably be squashed into some of our other patches as well (lmk if I should
> do that).
> 
> I also got dirty bitmap migrate working, but still needs some testing/cleanup.
> I'd also like for upstream to give their final decision on the fix I sent, then
> I'll confidently send that as well :)
> 
> @Dominik: If you apply these two patches, can you still reproduce the hang on
> abort? I'm not sure we experienced the 'same' hang.
> 
> 
> Stefan Reiter (2):
>    PVE: Don't expect complete_cb to be called outside coroutine
>    PVE: Don't call job_cancel in coroutines
> 
>   pve-backup.c | 33 ++++++++++++++++++++++++++++-----
>   1 file changed, 28 insertions(+), 5 deletions(-)
> 





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine
  2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter
@ 2020-10-27 14:16   ` Wolfgang Bumiller
  2020-10-27 14:57     ` Stefan Reiter
  0 siblings, 1 reply; 8+ messages in thread
From: Wolfgang Bumiller @ 2020-10-27 14:16 UTC (permalink / raw)
  To: Stefan Reiter; +Cc: pve-devel

On Thu, Oct 22, 2020 at 02:11:17PM +0200, Stefan Reiter wrote:
> We're at the mercy of the rest of QEMU here, and it sometimes decides to
> call pvebackup_complete_cb from a coroutine. This really doesn't matter
> to us, so don't assert and crash on it.
> 
> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> ---
>  pve-backup.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/pve-backup.c b/pve-backup.c
> index 53cf23ed5a..9179754dcb 100644
> --- a/pve-backup.c
> +++ b/pve-backup.c
> @@ -318,19 +318,18 @@ static void coroutine_fn pvebackup_co_complete_stream(void *opaque)
>  
>  static void pvebackup_complete_cb(void *opaque, int ret)
>  {
> -    assert(!qemu_in_coroutine());
> -
>      PVEBackupDevInfo *di = opaque;
>      di->completed_ret = ret;
>  
>      /*
>       * Schedule stream cleanup in async coroutine. close_image and finish might
> -     * take a while, so we can't block on them here.
> +     * take a while, so we can't block on them here. This way it also doesn't
> +     * matter if we're already running in a coroutine or not.
>       * Note: di is a pointer to an entry in the global backup_state struct, so
>       * it stays valid.
>       */
>      Coroutine *co = qemu_coroutine_create(pvebackup_co_complete_stream, di);
> -    aio_co_schedule(qemu_get_aio_context(), co);
> +    aio_co_enter(qemu_get_aio_context(), co);

Shouldn't this be decided based on `qemu_in_coroutine()`? Or are we
allowed to call enter regardless, I forgot...?

>  }
>  
>  static void coroutine_fn pvebackup_co_cancel(void *opaque)
> -- 
> 2.20.1




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines
  2020-10-22 12:11 ` [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines Stefan Reiter
@ 2020-10-27 14:17   ` Wolfgang Bumiller
  2020-10-27 14:57     ` Stefan Reiter
  0 siblings, 1 reply; 8+ messages in thread
From: Wolfgang Bumiller @ 2020-10-27 14:17 UTC (permalink / raw)
  To: Stefan Reiter; +Cc: pve-devel

On Thu, Oct 22, 2020 at 02:11:18PM +0200, Stefan Reiter wrote:
> ...because it hangs on cancelling other jobs in the txn if you do.
> 
> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> ---
>  pve-backup.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/pve-backup.c b/pve-backup.c
> index 9179754dcb..af2db0d4b9 100644
> --- a/pve-backup.c
> +++ b/pve-backup.c
> @@ -82,6 +82,12 @@ typedef struct PVEBackupDevInfo {
>      BlockJob *job;
>  } PVEBackupDevInfo;
>  
> +typedef struct JobCancelData {
> +    AioContext *ctx;
> +    Coroutine *co;
> +    Job *job;
> +} JobCancelData;
> +
>  static void pvebackup_propagate_error(Error *err)
>  {
>      qemu_mutex_lock(&backup_state.stat.lock);
> @@ -332,6 +338,18 @@ static void pvebackup_complete_cb(void *opaque, int ret)
>      aio_co_enter(qemu_get_aio_context(), co);
>  }
>  
> +/*
> + * job_cancel(_sync) does not like to be called from coroutines, so defer to
> + * main loop processing via a bottom half.
> + */
> +static void job_cancel_bh(void *opaque) {
> +    JobCancelData *data = (JobCancelData*)opaque;
> +    aio_context_acquire(data->job->aio_context);
> +    job_cancel_sync(data->job);
> +    aio_context_release(data->job->aio_context);
> +    aio_co_schedule(data->ctx, data->co);
> +}
> +
>  static void coroutine_fn pvebackup_co_cancel(void *opaque)
>  {
>      Error *cancel_err = NULL;
> @@ -357,7 +375,13 @@ static void coroutine_fn pvebackup_co_cancel(void *opaque)
>          NULL;
>  
>      if (cancel_job) {
> -        job_cancel(&cancel_job->job, false);
> +        JobCancelData data = {
> +            .ctx = qemu_get_current_aio_context(),
> +            .co = qemu_coroutine_self(),
> +            .job = &cancel_job->job,
> +        };
> +        aio_bh_schedule_oneshot(data.ctx, job_cancel_bh, &data);
> +        qemu_coroutine_yield();

Don't we need some kind of synchronization here? The yield does not
guarantee we don't run before the bh is run, or does it? Maybe a condvar
to trigger the coro after the job cancel bh?

>      }
>  
>      qemu_co_mutex_unlock(&backup_state.backup_mutex);
> -- 
> 2.20.1




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine
  2020-10-27 14:16   ` Wolfgang Bumiller
@ 2020-10-27 14:57     ` Stefan Reiter
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Reiter @ 2020-10-27 14:57 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pve-devel

On 10/27/20 3:16 PM, Wolfgang Bumiller wrote:
> On Thu, Oct 22, 2020 at 02:11:17PM +0200, Stefan Reiter wrote:
>> We're at the mercy of the rest of QEMU here, and it sometimes decides to
>> call pvebackup_complete_cb from a coroutine. This really doesn't matter
>> to us, so don't assert and crash on it.
>>
>> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
>> ---
>>   pve-backup.c | 7 +++----
>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/pve-backup.c b/pve-backup.c
>> index 53cf23ed5a..9179754dcb 100644
>> --- a/pve-backup.c
>> +++ b/pve-backup.c
>> @@ -318,19 +318,18 @@ static void coroutine_fn pvebackup_co_complete_stream(void *opaque)
>>   
>>   static void pvebackup_complete_cb(void *opaque, int ret)
>>   {
>> -    assert(!qemu_in_coroutine());
>> -
>>       PVEBackupDevInfo *di = opaque;
>>       di->completed_ret = ret;
>>   
>>       /*
>>        * Schedule stream cleanup in async coroutine. close_image and finish might
>> -     * take a while, so we can't block on them here.
>> +     * take a while, so we can't block on them here. This way it also doesn't
>> +     * matter if we're already running in a coroutine or not.
>>        * Note: di is a pointer to an entry in the global backup_state struct, so
>>        * it stays valid.
>>        */
>>       Coroutine *co = qemu_coroutine_create(pvebackup_co_complete_stream, di);
>> -    aio_co_schedule(qemu_get_aio_context(), co);
>> +    aio_co_enter(qemu_get_aio_context(), co);
> 
> Shouldn't this be decided based on `qemu_in_coroutine()`? Or are we
> allowed to call enter regardless, I forgot...?
> 

We are allowed to call whenever, if we're in a coroutine in the same 
context this becomes a call, if not it's scheduled correctly.

>>   }
>>   
>>   static void coroutine_fn pvebackup_co_cancel(void *opaque)
>> -- 
>> 2.20.1




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines
  2020-10-27 14:17   ` Wolfgang Bumiller
@ 2020-10-27 14:57     ` Stefan Reiter
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Reiter @ 2020-10-27 14:57 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pve-devel

On 10/27/20 3:17 PM, Wolfgang Bumiller wrote:
> On Thu, Oct 22, 2020 at 02:11:18PM +0200, Stefan Reiter wrote:
>> ...because it hangs on cancelling other jobs in the txn if you do.
>>
>> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
>> ---
>>   pve-backup.c | 26 +++++++++++++++++++++++++-
>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/pve-backup.c b/pve-backup.c
>> index 9179754dcb..af2db0d4b9 100644
>> --- a/pve-backup.c
>> +++ b/pve-backup.c
>> @@ -82,6 +82,12 @@ typedef struct PVEBackupDevInfo {
>>       BlockJob *job;
>>   } PVEBackupDevInfo;
>>   
>> +typedef struct JobCancelData {
>> +    AioContext *ctx;
>> +    Coroutine *co;
>> +    Job *job;
>> +} JobCancelData;
>> +
>>   static void pvebackup_propagate_error(Error *err)
>>   {
>>       qemu_mutex_lock(&backup_state.stat.lock);
>> @@ -332,6 +338,18 @@ static void pvebackup_complete_cb(void *opaque, int ret)
>>       aio_co_enter(qemu_get_aio_context(), co);
>>   }
>>   
>> +/*
>> + * job_cancel(_sync) does not like to be called from coroutines, so defer to
>> + * main loop processing via a bottom half.
>> + */
>> +static void job_cancel_bh(void *opaque) {
>> +    JobCancelData *data = (JobCancelData*)opaque;
>> +    aio_context_acquire(data->job->aio_context);
>> +    job_cancel_sync(data->job);
>> +    aio_context_release(data->job->aio_context);
>> +    aio_co_schedule(data->ctx, data->co);
>> +}
>> +
>>   static void coroutine_fn pvebackup_co_cancel(void *opaque)
>>   {
>>       Error *cancel_err = NULL;
>> @@ -357,7 +375,13 @@ static void coroutine_fn pvebackup_co_cancel(void *opaque)
>>           NULL;
>>   
>>       if (cancel_job) {
>> -        job_cancel(&cancel_job->job, false);
>> +        JobCancelData data = {
>> +            .ctx = qemu_get_current_aio_context(),
>> +            .co = qemu_coroutine_self(),
>> +            .job = &cancel_job->job,
>> +        };
>> +        aio_bh_schedule_oneshot(data.ctx, job_cancel_bh, &data);
>> +        qemu_coroutine_yield();
> 
> Don't we need some kind of synchronization here? The yield does not
> guarantee we don't run before the bh is run, or does it? Maybe a condvar
> to trigger the coro after the job cancel bh?
> 

No, it cannot race, since we execute the BH in the same context as the 
coroutine (qemu_get_current_aio_context()). The coroutine thus blocks 
execution of the BH until it yields.

See also code and comment in aio_co_reschedule_self() from 'util/async.c'.

>>       }
>>   
>>       qemu_co_mutex_unlock(&backup_state.backup_mutex);
>> -- 
>> 2.20.1




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-10-27 14:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-22 12:11 [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Stefan Reiter
2020-10-22 12:11 ` [pve-devel] [PATCH qemu 1/2] PVE: Don't expect complete_cb to be called outside coroutine Stefan Reiter
2020-10-27 14:16   ` Wolfgang Bumiller
2020-10-27 14:57     ` Stefan Reiter
2020-10-22 12:11 ` [pve-devel] [PATCH qemu 2/2] PVE: Don't call job_cancel in coroutines Stefan Reiter
2020-10-27 14:17   ` Wolfgang Bumiller
2020-10-27 14:57     ` Stefan Reiter
2020-10-22 13:07 ` [pve-devel] [PATCH 0/2] QEMU backup cancellation fixes Dominik Csapak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal