[PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds

all lists on lists.proxmox.com
 help / color / mirror / Atom feed

* [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
@ 2026-02-10 11:15 Dominik Csapak
  2026-02-12 20:33 ` Benjamin McGuire
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Dominik Csapak @ 2026-02-10 11:15 UTC (permalink / raw)
  To: pve-devel

When qmeventd detects a vm exiting, it starts 'qm cleanup' to cleanup
files, executing hookscripts, etc.

Since the vm process exits is sometimes not instant, wait up to 30
seconds here to start the cleanup process instead of immediately
aborting if the pid still exits. This prevented executing the hookscript
on the 'post-stop' phase.

This can be easily reproduced by e.g. passing through a usb device,
which delays the qemu process exit for a few seconds.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
changes from v1:
* use correct while condition (time() is always >= $starttime)

original comment:

The 30 second timeout was arbitrarily chosen, but we could probably
start with something smaller, like 10 seconds? Could be adapted on
applying though.

In my (short) tests the usb passthrough part only adds a single second,
but i can imagine different devices on other systems could block it for
much longer.

 src/PVE/CLI/qm.pm | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm
index bdae9641..16875ed2 100755
--- a/src/PVE/CLI/qm.pm
+++ b/src/PVE/CLI/qm.pm
@@ -1101,8 +1101,19 @@ __PACKAGE__->register_method({
             60,
             sub {
                 my $conf = PVE::QemuConfig->load_config($vmid);
+
+                # wait for some timeout until vm process exits, since this might not be instant
+                my $timeout = 30;
+                my $starttime = time();
                 my $pid = PVE::QemuServer::check_running($vmid);
-                die "vm still running\n" if $pid;
+                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
+
+                while ($pid && (time() - $starttime) < $timeout) {
+                    sleep(1);
+                    $pid = PVE::QemuServer::check_running($vmid);
+                }
+
+                die "vm still running - aborting cleanup\n" if $pid;
 
                 # Rollback already does cleanup when preparing and afterwards temporarily drops the
                 # lock on the configuration file to rollback the volumes. Deactivating volumes here
-- 
2.47.3





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-10 11:15 [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds Dominik Csapak
@ 2026-02-12 20:33 ` Benjamin McGuire
  2026-02-13 11:40 ` Fabian Grünbichler
  2026-02-13 12:14 ` Fiona Ebner
  2 siblings, 0 replies; 14+ messages in thread
From: Benjamin McGuire @ 2026-02-12 20:33 UTC (permalink / raw)
  To: Dominik Csapak; +Cc: pve-devel

Tested-by: Benjamin McGuire <jaminmc@gmail.com>


> On Feb 10, 2026, at 6:15 AM, Dominik Csapak <d.csapak@proxmox.com> wrote:
> 
> When qmeventd detects a vm exiting, it starts 'qm cleanup' to cleanup
> files, executing hookscripts, etc.
> 
> Since the vm process exits is sometimes not instant, wait up to 30
> seconds here to start the cleanup process instead of immediately
> aborting if the pid still exits. This prevented executing the hookscript
> on the 'post-stop' phase.
> 
> This can be easily reproduced by e.g. passing through a usb device,
> which delays the qemu process exit for a few seconds.
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> changes from v1:
> * use correct while condition (time() is always >= $starttime)
> 
> original comment:
> 
> The 30 second timeout was arbitrarily chosen, but we could probably
> start with something smaller, like 10 seconds? Could be adapted on
> applying though.
> 
> In my (short) tests the usb passthrough part only adds a single second,
> but i can imagine different devices on other systems could block it for
> much longer.
> 
> src/PVE/CLI/qm.pm | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm
> index bdae9641..16875ed2 100755
> --- a/src/PVE/CLI/qm.pm
> +++ b/src/PVE/CLI/qm.pm
> @@ -1101,8 +1101,19 @@ __PACKAGE__->register_method({
>             60,
>             sub {
>                 my $conf = PVE::QemuConfig->load_config($vmid);
> +
> +                # wait for some timeout until vm process exits, since this might not be instant
> +                my $timeout = 30;
> +                my $starttime = time();
>                 my $pid = PVE::QemuServer::check_running($vmid);
> -                die "vm still running\n" if $pid;
> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
> +
> +                while ($pid && (time() - $starttime) < $timeout) {
> +                    sleep(1);
> +                    $pid = PVE::QemuServer::check_running($vmid);
> +                }
> +
> +                die "vm still running - aborting cleanup\n" if $pid;
> 
>                 # Rollback already does cleanup when preparing and afterwards temporarily drops the
>                 # lock on the configuration file to rollback the volumes. Deactivating volumes here
> -- 
> 2.47.3
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-10 11:15 [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds Dominik Csapak
  2026-02-12 20:33 ` Benjamin McGuire
@ 2026-02-13 11:40 ` Fabian Grünbichler
  2026-02-13 12:14 ` Fiona Ebner
  2 siblings, 0 replies; 14+ messages in thread
From: Fabian Grünbichler @ 2026-02-13 11:40 UTC (permalink / raw)
  To: Dominik Csapak, pve-devel

On February 10, 2026 12:15 pm, Dominik Csapak wrote:
> When qmeventd detects a vm exiting, it starts 'qm cleanup' to cleanup
> files, executing hookscripts, etc.
> 
> Since the vm process exits is sometimes not instant, wait up to 30
> seconds here to start the cleanup process instead of immediately
> aborting if the pid still exits. This prevented executing the hookscript
> on the 'post-stop' phase.
> 
> This can be easily reproduced by e.g. passing through a usb device,
> which delays the qemu process exit for a few seconds.
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> changes from v1:
> * use correct while condition (time() is always >= $starttime)
> 
> original comment:
> 
> The 30 second timeout was arbitrarily chosen, but we could probably
> start with something smaller, like 10 seconds? Could be adapted on
> applying though.
> 
> In my (short) tests the usb passthrough part only adds a single second,
> but i can imagine different devices on other systems could block it for
> much longer.
> 
>  src/PVE/CLI/qm.pm | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm
> index bdae9641..16875ed2 100755
> --- a/src/PVE/CLI/qm.pm
> +++ b/src/PVE/CLI/qm.pm
> @@ -1101,8 +1101,19 @@ __PACKAGE__->register_method({
>              60,
>              sub {
>                  my $conf = PVE::QemuConfig->load_config($vmid);
> +
> +                # wait for some timeout until vm process exits, since this might not be instant
> +                my $timeout = 30;
> +                my $starttime = time();
>                  my $pid = PVE::QemuServer::check_running($vmid);
> -                die "vm still running\n" if $pid;
> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
> +
> +                while ($pid && (time() - $starttime) < $timeout) {
> +                    sleep(1);
> +                    $pid = PVE::QemuServer::check_running($vmid);

nit: this helper is deprecated - and this call here is only running in
the context of "guest is local", we already obtained the lock and loaded
the config, so we know that invariant holds, so this new code (and the
old line above) can just use
PVE::QemuServer::Helpers::vm_running_locally instead..

> +                }
> +
> +                die "vm still running - aborting cleanup\n" if $pid;
>  
>                  # Rollback already does cleanup when preparing and afterwards temporarily drops the
>                  # lock on the configuration file to rollback the volumes. Deactivating volumes here
> -- 
> 2.47.3
> 
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-10 11:15 [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds Dominik Csapak
  2026-02-12 20:33 ` Benjamin McGuire
  2026-02-13 11:40 ` Fabian Grünbichler
@ 2026-02-13 12:14 ` Fiona Ebner
  2026-02-13 12:20   ` Fabian Grünbichler
  2026-02-13 12:22   ` Dominik Csapak
  2 siblings, 2 replies; 14+ messages in thread
From: Fiona Ebner @ 2026-02-13 12:14 UTC (permalink / raw)
  To: Dominik Csapak, pve-devel

Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
> When qmeventd detects a vm exiting, it starts 'qm cleanup' to cleanup
> files, executing hookscripts, etc.
> 
> Since the vm process exits is sometimes not instant, wait up to 30
> seconds here to start the cleanup process instead of immediately
> aborting if the pid still exits. This prevented executing the hookscript
> on the 'post-stop' phase.
> 
> This can be easily reproduced by e.g. passing through a usb device,
> which delays the qemu process exit for a few seconds.
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> changes from v1:
> * use correct while condition (time() is always >= $starttime)
> 
> original comment:
> 
> The 30 second timeout was arbitrarily chosen, but we could probably
> start with something smaller, like 10 seconds? Could be adapted on
> applying though.
> 
> In my (short) tests the usb passthrough part only adds a single second,
> but i can imagine different devices on other systems could block it for
> much longer.
> 
>  src/PVE/CLI/qm.pm | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm
> index bdae9641..16875ed2 100755
> --- a/src/PVE/CLI/qm.pm
> +++ b/src/PVE/CLI/qm.pm
> @@ -1101,8 +1101,19 @@ __PACKAGE__->register_method({
>              60,
>              sub {
>                  my $conf = PVE::QemuConfig->load_config($vmid);
> +
> +                # wait for some timeout until vm process exits, since this might not be instant

s/timeout/time/

Nit: s/vm/the QEMU/

Maybe add "after the QMP 'SHUTDOWN' event"?

> +                my $timeout = 30;
> +                my $starttime = time();
>                  my $pid = PVE::QemuServer::check_running($vmid);
> -                die "vm still running\n" if $pid;
> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;

While we're at it, we could improve the message here. Something like
'QEMU process $pid for VM $vmid still running (or newly started)'
Having the PID is nice info for developers/support engineers and the
case where a new instance is started before the cleanup was done is also
possible.

In fact, the case with the new instance is easily triggered by 'stop'
mode backups. Maybe we should fix that up first before adding a timeout
here?

Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running


> +
> +                while ($pid && (time() - $starttime) < $timeout) {
> +                    sleep(1);
> +                    $pid = PVE::QemuServer::check_running($vmid);
> +                }
> +
> +                die "vm still running - aborting cleanup\n" if $pid;
>  
>                  # Rollback already does cleanup when preparing and afterwards temporarily drops the
>                  # lock on the configuration file to rollback the volumes. Deactivating volumes here





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-13 12:14 ` Fiona Ebner
@ 2026-02-13 12:20   ` Fabian Grünbichler
  2026-02-13 13:16     ` Fiona Ebner
  2026-02-13 12:22   ` Dominik Csapak
  1 sibling, 1 reply; 14+ messages in thread
From: Fabian Grünbichler @ 2026-02-13 12:20 UTC (permalink / raw)
  To: Dominik Csapak, Fiona Ebner, pve-devel

On February 13, 2026 1:14 pm, Fiona Ebner wrote:
> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
>> When qmeventd detects a vm exiting, it starts 'qm cleanup' to cleanup
>> files, executing hookscripts, etc.
>> 
>> Since the vm process exits is sometimes not instant, wait up to 30
>> seconds here to start the cleanup process instead of immediately
>> aborting if the pid still exits. This prevented executing the hookscript
>> on the 'post-stop' phase.
>> 
>> This can be easily reproduced by e.g. passing through a usb device,
>> which delays the qemu process exit for a few seconds.
>> 
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>> changes from v1:
>> * use correct while condition (time() is always >= $starttime)
>> 
>> original comment:
>> 
>> The 30 second timeout was arbitrarily chosen, but we could probably
>> start with something smaller, like 10 seconds? Could be adapted on
>> applying though.
>> 
>> In my (short) tests the usb passthrough part only adds a single second,
>> but i can imagine different devices on other systems could block it for
>> much longer.
>> 
>>  src/PVE/CLI/qm.pm | 13 ++++++++++++-
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>> 
>> diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm
>> index bdae9641..16875ed2 100755
>> --- a/src/PVE/CLI/qm.pm
>> +++ b/src/PVE/CLI/qm.pm
>> @@ -1101,8 +1101,19 @@ __PACKAGE__->register_method({
>>              60,
>>              sub {
>>                  my $conf = PVE::QemuConfig->load_config($vmid);
>> +
>> +                # wait for some timeout until vm process exits, since this might not be instant
> 
> s/timeout/time/
> 
> Nit: s/vm/the QEMU/
> 
> Maybe add "after the QMP 'SHUTDOWN' event"?
> 
>> +                my $timeout = 30;
>> +                my $starttime = time();
>>                  my $pid = PVE::QemuServer::check_running($vmid);
>> -                die "vm still running\n" if $pid;
>> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
> 
> While we're at it, we could improve the message here. Something like
> 'QEMU process $pid for VM $vmid still running (or newly started)'
> Having the PID is nice info for developers/support engineers and the
> case where a new instance is started before the cleanup was done is also
> possible.
> 
> In fact, the case with the new instance is easily triggered by 'stop'
> mode backups. Maybe we should fix that up first before adding a timeout
> here?
> 
> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
> Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running

does this mean we should actually have some sort of mechanism similar to
the reboot flag to indicate a pending cleanup, and block/delay starts if
it is still set?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-13 12:14 ` Fiona Ebner
  2026-02-13 12:20   ` Fabian Grünbichler
@ 2026-02-13 12:22   ` Dominik Csapak
  1 sibling, 0 replies; 14+ messages in thread
From: Dominik Csapak @ 2026-02-13 12:22 UTC (permalink / raw)
  To: Fiona Ebner, pve-devel



On 2/13/26 1:14 PM, Fiona Ebner wrote:
> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
>> When qmeventd detects a vm exiting, it starts 'qm cleanup' to cleanup
>> files, executing hookscripts, etc.
>>
>> Since the vm process exits is sometimes not instant, wait up to 30
>> seconds here to start the cleanup process instead of immediately
>> aborting if the pid still exits. This prevented executing the hookscript
>> on the 'post-stop' phase.
>>
>> This can be easily reproduced by e.g. passing through a usb device,
>> which delays the qemu process exit for a few seconds.
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>> changes from v1:
>> * use correct while condition (time() is always >= $starttime)
>>
>> original comment:
>>
>> The 30 second timeout was arbitrarily chosen, but we could probably
>> start with something smaller, like 10 seconds? Could be adapted on
>> applying though.
>>
>> In my (short) tests the usb passthrough part only adds a single second,
>> but i can imagine different devices on other systems could block it for
>> much longer.
>>
>>   src/PVE/CLI/qm.pm | 13 ++++++++++++-
>>   1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm
>> index bdae9641..16875ed2 100755
>> --- a/src/PVE/CLI/qm.pm
>> +++ b/src/PVE/CLI/qm.pm
>> @@ -1101,8 +1101,19 @@ __PACKAGE__->register_method({
>>               60,
>>               sub {
>>                   my $conf = PVE::QemuConfig->load_config($vmid);
>> +
>> +                # wait for some timeout until vm process exits, since this might not be instant
> 
> s/timeout/time/
> 
> Nit: s/vm/the QEMU/
> 
> Maybe add "after the QMP 'SHUTDOWN' event"?
> 
>> +                my $timeout = 30;
>> +                my $starttime = time();
>>                   my $pid = PVE::QemuServer::check_running($vmid);
>> -                die "vm still running\n" if $pid;
>> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
> 
> While we're at it, we could improve the message here. Something like
> 'QEMU process $pid for VM $vmid still running (or newly started)'
> Having the PID is nice info for developers/support engineers and the
> case where a new instance is started before the cleanup was done is also
> possible.
> 
> In fact, the case with the new instance is easily triggered by 'stop'
> mode backups. Maybe we should fix that up first before adding a timeout
> here?
> 
> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
> Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running
> 

Sounds good, one possibility would be to do no cleanup at all when doing
a stop mode backup?
We already know we'll need the resources (pid/socket/etc. files, 
vgpus,...) again?

Or is there some situation where that might not be the case?

> 
>> +
>> +                while ($pid && (time() - $starttime) < $timeout) {
>> +                    sleep(1);
>> +                    $pid = PVE::QemuServer::check_running($vmid);
>> +                }
>> +
>> +                die "vm still running - aborting cleanup\n" if $pid;
>>   
>>                   # Rollback already does cleanup when preparing and afterwards temporarily drops the
>>                   # lock on the configuration file to rollback the volumes. Deactivating volumes here
> 





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-13 12:20   ` Fabian Grünbichler
@ 2026-02-13 13:16     ` Fiona Ebner
  2026-02-16  8:42       ` Fabian Grünbichler
  0 siblings, 1 reply; 14+ messages in thread
From: Fiona Ebner @ 2026-02-13 13:16 UTC (permalink / raw)
  To: Fabian Grünbichler, Dominik Csapak, pve-devel

Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler:
> On February 13, 2026 1:14 pm, Fiona Ebner wrote:
>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
>>> +                my $timeout = 30;
>>> +                my $starttime = time();
>>>                  my $pid = PVE::QemuServer::check_running($vmid);
>>> -                die "vm still running\n" if $pid;
>>> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
>>
>> While we're at it, we could improve the message here. Something like
>> 'QEMU process $pid for VM $vmid still running (or newly started)'
>> Having the PID is nice info for developers/support engineers and the
>> case where a new instance is started before the cleanup was done is also
>> possible.
>>
>> In fact, the case with the new instance is easily triggered by 'stop'
>> mode backups. Maybe we should fix that up first before adding a timeout
>> here?
>>
>> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
>> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
>> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
>> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
>> Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running
> 
> does this mean we should actually have some sort of mechanism similar to
> the reboot flag to indicate a pending cleanup, and block/delay starts if
> it is still set?

Blocking/delaying starts is not what happens for the reboot flag/file:

> Feb 13 14:00:16 pve9a1 qm[124470]: <root@pam> starting task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam:
> Feb 13 14:00:16 pve9a1 qm[124472]: <root@pam> starting task UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
> Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102: UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
> [...]
> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s CPU time, 2G memory peak.
> Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102
> Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock...
> Feb 13 14:00:23 pve9a1 qm[124470]: <root@pam> end task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: OK
> Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope.
> Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620.
> Feb 13 14:00:23 pve9a1 qmeventd[124565]:  OK
> Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running

Currently, it's just indicating whether the cleanup handler should start
the VM again afterwards.

Am 13.02.26 um 1:22 PM schrieb Dominik Csapak:
> Sounds good, one possibility would be to do no cleanup at all when doing
> a stop mode backup?
> We already know we'll need the resources (pid/socket/etc. files, vgpus,...) again?
> 
> Or is there some situation where that might not be the case? 

We do it for reboot (if not another start task sneaks in like in my
example above), and I don't see a good reason from the top of my head
why 'stop' mode backup should behave differently from a reboot (for
running VMs). It even applies pending changes just like a reboot right now.

I'm not sure if there is an actual need to do cleanup or if we could
also skip it when we are planning to spin up another instance right
away. But we do it for reboot, so the "safe" variant is also doing it
for 'stop' mode backup. History tells me it's been there since the
reboot functionality was added:
https://lists.proxmox.com/pipermail/pve-devel/2019-September/038988.html




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-13 13:16     ` Fiona Ebner
@ 2026-02-16  8:42       ` Fabian Grünbichler
  2026-02-16  9:15         ` Fiona Ebner
  0 siblings, 1 reply; 14+ messages in thread
From: Fabian Grünbichler @ 2026-02-16  8:42 UTC (permalink / raw)
  To: Dominik Csapak, Fiona Ebner, pve-devel

On February 13, 2026 2:16 pm, Fiona Ebner wrote:
> Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler:
>> On February 13, 2026 1:14 pm, Fiona Ebner wrote:
>>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
>>>> +                my $timeout = 30;
>>>> +                my $starttime = time();
>>>>                  my $pid = PVE::QemuServer::check_running($vmid);
>>>> -                die "vm still running\n" if $pid;
>>>> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
>>>
>>> While we're at it, we could improve the message here. Something like
>>> 'QEMU process $pid for VM $vmid still running (or newly started)'
>>> Having the PID is nice info for developers/support engineers and the
>>> case where a new instance is started before the cleanup was done is also
>>> possible.
>>>
>>> In fact, the case with the new instance is easily triggered by 'stop'
>>> mode backups. Maybe we should fix that up first before adding a timeout
>>> here?
>>>
>>> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
>>> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
>>> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
>>> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running
>> 
>> does this mean we should actually have some sort of mechanism similar to
>> the reboot flag to indicate a pending cleanup, and block/delay starts if
>> it is still set?
> 
> Blocking/delaying starts is not what happens for the reboot flag/file:

that's not what I meant, the similarity was just "have a flag", not
"have a flag that behaves identical" ;)

my proposal was:
- add a flag that indicates cleanup is pending (similar to reboot is
  pending)
- *handle that flag* in the start flow to wait for the cleanup to be
  done before starting

>> Feb 13 14:00:16 pve9a1 qm[124470]: <root@pam> starting task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam:
>> Feb 13 14:00:16 pve9a1 qm[124472]: <root@pam> starting task UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>> Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102: UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>> [...]
>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s CPU time, 2G memory peak.
>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102
>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock...
>> Feb 13 14:00:23 pve9a1 qm[124470]: <root@pam> end task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: OK
>> Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope.
>> Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620.
>> Feb 13 14:00:23 pve9a1 qmeventd[124565]:  OK
>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running
> 
> Currently, it's just indicating whether the cleanup handler should start
> the VM again afterwards.
> 
> Am 13.02.26 um 1:22 PM schrieb Dominik Csapak:
>> Sounds good, one possibility would be to do no cleanup at all when doing
>> a stop mode backup?
>> We already know we'll need the resources (pid/socket/etc. files, vgpus,...) again?
>> 
>> Or is there some situation where that might not be the case? 
> 
> We do it for reboot (if not another start task sneaks in like in my
> example above), and I don't see a good reason from the top of my head
> why 'stop' mode backup should behave differently from a reboot (for
> running VMs). It even applies pending changes just like a reboot right now.

but what about external callers doing something like:

- stop
- do whatever
- start

in rapid (automated) succession? those would still (possibly) trigger
cleanup after "doing whatever" and starting the VM again already? and in
particular if we skip cleanup for "our" cases of stop;start it will be
easy to introduce sideeffects in cleanup that break such usage?

> I'm not sure if there is an actual need to do cleanup or if we could
> also skip it when we are planning to spin up another instance right
> away. But we do it for reboot, so the "safe" variant is also doing it
> for 'stop' mode backup. History tells me it's been there since the
> reboot functionality was added:
> https://lists.proxmox.com/pipermail/pve-devel/2019-September/038988.html
> 




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-16  8:42       ` Fabian Grünbichler
@ 2026-02-16  9:15         ` Fiona Ebner
  2026-02-19 10:15           ` Dominik Csapak
  0 siblings, 1 reply; 14+ messages in thread
From: Fiona Ebner @ 2026-02-16  9:15 UTC (permalink / raw)
  To: Fabian Grünbichler, Dominik Csapak, pve-devel

Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler:
> On February 13, 2026 2:16 pm, Fiona Ebner wrote:
>> Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler:
>>> On February 13, 2026 1:14 pm, Fiona Ebner wrote:
>>>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
>>>>> +                my $timeout = 30;
>>>>> +                my $starttime = time();
>>>>>                  my $pid = PVE::QemuServer::check_running($vmid);
>>>>> -                die "vm still running\n" if $pid;
>>>>> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
>>>>
>>>> While we're at it, we could improve the message here. Something like
>>>> 'QEMU process $pid for VM $vmid still running (or newly started)'
>>>> Having the PID is nice info for developers/support engineers and the
>>>> case where a new instance is started before the cleanup was done is also
>>>> possible.
>>>>
>>>> In fact, the case with the new instance is easily triggered by 'stop'
>>>> mode backups. Maybe we should fix that up first before adding a timeout
>>>> here?
>>>>
>>>> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
>>>> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
>>>> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
>>>> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running
>>>
>>> does this mean we should actually have some sort of mechanism similar to
>>> the reboot flag to indicate a pending cleanup, and block/delay starts if
>>> it is still set?
>>
>> Blocking/delaying starts is not what happens for the reboot flag/file:
> 
> that's not what I meant, the similarity was just "have a flag", not
> "have a flag that behaves identical" ;)
> 
> my proposal was:
> - add a flag that indicates cleanup is pending (similar to reboot is
>   pending)
> - *handle that flag* in the start flow to wait for the cleanup to be
>   done before starting

Shouldn't we change the reboot flag to also do this?

>>> Feb 13 14:00:16 pve9a1 qm[124470]: <root@pam> starting task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam:
>>> Feb 13 14:00:16 pve9a1 qm[124472]: <root@pam> starting task UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>>> Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102: UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>>> [...]
>>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s CPU time, 2G memory peak.
>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102
>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock...
>>> Feb 13 14:00:23 pve9a1 qm[124470]: <root@pam> end task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: OK
>>> Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope.
>>> Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620.
>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]:  OK
>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running
>>
>> Currently, it's just indicating whether the cleanup handler should start
>> the VM again afterwards.
>>
>> Am 13.02.26 um 1:22 PM schrieb Dominik Csapak:
>>> Sounds good, one possibility would be to do no cleanup at all when doing
>>> a stop mode backup?
>>> We already know we'll need the resources (pid/socket/etc. files, vgpus,...) again?
>>>
>>> Or is there some situation where that might not be the case? 
>>
>> We do it for reboot (if not another start task sneaks in like in my
>> example above), and I don't see a good reason from the top of my head
>> why 'stop' mode backup should behave differently from a reboot (for
>> running VMs). It even applies pending changes just like a reboot right now.
> 
> but what about external callers doing something like:
> 
> - stop
> - do whatever
> - start
> 
> in rapid (automated) succession? those would still (possibly) trigger
> cleanup after "doing whatever" and starting the VM again already? and in
> particular if we skip cleanup for "our" cases of stop;start it will be
> easy to introduce sideeffects in cleanup that break such usage?

I did not argue for skipping cleanup. I argued for being consistent with
reboot where we (try to) do cleanup. I just wasn't sure it's really needed.

>> I'm not sure if there is an actual need to do cleanup or if we could

I guess the actual need is to have more consistent behavior.

>> also skip it when we are planning to spin up another instance right
>> away. But we do it for reboot, so the "safe" variant is also doing it
>> for 'stop' mode backup. History tells me it's been there since the
>> reboot functionality was added:
>> https://lists.proxmox.com/pipermail/pve-devel/2019-September/038988.html






^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-16  9:15         ` Fiona Ebner
@ 2026-02-19 10:15           ` Dominik Csapak
  2026-02-19 13:27             ` Fiona Ebner
  0 siblings, 1 reply; 14+ messages in thread
From: Dominik Csapak @ 2026-02-19 10:15 UTC (permalink / raw)
  To: Fiona Ebner, Fabian Grünbichler, pve-devel



On 2/16/26 10:15 AM, Fiona Ebner wrote:
> Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler:
>> On February 13, 2026 2:16 pm, Fiona Ebner wrote:
>>> Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler:
>>>> On February 13, 2026 1:14 pm, Fiona Ebner wrote:
>>>>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
>>>>>> +                my $timeout = 30;
>>>>>> +                my $starttime = time();
>>>>>>                   my $pid = PVE::QemuServer::check_running($vmid);
>>>>>> -                die "vm still running\n" if $pid;
>>>>>> +                warn "vm still running - waiting up to $timeout seconds\n" if $pid;
>>>>>
>>>>> While we're at it, we could improve the message here. Something like
>>>>> 'QEMU process $pid for VM $vmid still running (or newly started)'
>>>>> Having the PID is nice info for developers/support engineers and the
>>>>> case where a new instance is started before the cleanup was done is also
>>>>> possible.
>>>>>
>>>>> In fact, the case with the new instance is easily triggered by 'stop'
>>>>> mode backups. Maybe we should fix that up first before adding a timeout
>>>>> here?
>>>>>
>>>>> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
>>>>> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
>>>>> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
>>>>> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running
>>>>
>>>> does this mean we should actually have some sort of mechanism similar to
>>>> the reboot flag to indicate a pending cleanup, and block/delay starts if
>>>> it is still set?
>>>
>>> Blocking/delaying starts is not what happens for the reboot flag/file:
>>
>> that's not what I meant, the similarity was just "have a flag", not
>> "have a flag that behaves identical" ;)
>>
>> my proposal was:
>> - add a flag that indicates cleanup is pending (similar to reboot is
>>    pending)
>> - *handle that flag* in the start flow to wait for the cleanup to be
>>    done before starting
> 
> Shouldn't we change the reboot flag to also do this?
> 
>>>> Feb 13 14:00:16 pve9a1 qm[124470]: <root@pam> starting task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam:
>>>> Feb 13 14:00:16 pve9a1 qm[124472]: <root@pam> starting task UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>>>> Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102: UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>>>> [...]
>>>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>>>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s CPU time, 2G memory peak.
>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102
>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock...
>>>> Feb 13 14:00:23 pve9a1 qm[124470]: <root@pam> end task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: OK
>>>> Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope.
>>>> Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620.
>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]:  OK
>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running
>>>
>>> Currently, it's just indicating whether the cleanup handler should start
>>> the VM again afterwards.
>>>
>>> Am 13.02.26 um 1:22 PM schrieb Dominik Csapak:
>>>> Sounds good, one possibility would be to do no cleanup at all when doing
>>>> a stop mode backup?
>>>> We already know we'll need the resources (pid/socket/etc. files, vgpus,...) again?
>>>>
>>>> Or is there some situation where that might not be the case?
>>>
>>> We do it for reboot (if not another start task sneaks in like in my
>>> example above), and I don't see a good reason from the top of my head
>>> why 'stop' mode backup should behave differently from a reboot (for
>>> running VMs). It even applies pending changes just like a reboot right now.
>>
>> but what about external callers doing something like:
>>
>> - stop
>> - do whatever
>> - start
>>
>> in rapid (automated) succession? those would still (possibly) trigger
>> cleanup after "doing whatever" and starting the VM again already? and in
>> particular if we skip cleanup for "our" cases of stop;start it will be
>> easy to introduce sideeffects in cleanup that break such usage?
> 
> I did not argue for skipping cleanup. I argued for being consistent with
> reboot where we (try to) do cleanup. I just wasn't sure it's really needed.
> 
>>> I'm not sure if there is an actual need to do cleanup or if we could
> 
> I guess the actual need is to have more consistent behavior.
> 

ok so i think we'd need to
* create a cleanup flag for each vm when qmevent detects a vm shutting 
down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp)
* removing that cleanup flag after cleanup (obviously)
* on start, check for that flag and block for some timeout before 
starting (e.g. check the timestamp in the flag if it's longer than some
time, start it regardless?)

?

>>> also skip it when we are planning to spin up another instance right
>>> away. But we do it for reboot, so the "safe" variant is also doing it
>>> for 'stop' mode backup. History tells me it's been there since the
>>> reboot functionality was added:
>>> https://lists.proxmox.com/pipermail/pve-devel/2019-September/038988.html
> 
> 





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-19 10:15           ` Dominik Csapak
@ 2026-02-19 13:27             ` Fiona Ebner
  2026-02-20  9:36               ` Dominik Csapak
  0 siblings, 1 reply; 14+ messages in thread
From: Fiona Ebner @ 2026-02-19 13:27 UTC (permalink / raw)
  To: Dominik Csapak, Fabian Grünbichler, pve-devel

Am 19.02.26 um 11:15 AM schrieb Dominik Csapak:
> On 2/16/26 10:15 AM, Fiona Ebner wrote:
>> Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler:
>>> On February 13, 2026 2:16 pm, Fiona Ebner wrote:
>>>> Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler:
>>>>> On February 13, 2026 1:14 pm, Fiona Ebner wrote:
>>>>>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
>>>>>>> +                my $timeout = 30;
>>>>>>> +                my $starttime = time();
>>>>>>>                   my $pid = PVE::QemuServer::check_running($vmid);
>>>>>>> -                die "vm still running\n" if $pid;
>>>>>>> +                warn "vm still running - waiting up to $timeout
>>>>>>> seconds\n" if $pid;
>>>>>>
>>>>>> While we're at it, we could improve the message here. Something like
>>>>>> 'QEMU process $pid for VM $vmid still running (or newly started)'
>>>>>> Having the PID is nice info for developers/support engineers and the
>>>>>> case where a new instance is started before the cleanup was done
>>>>>> is also
>>>>>> possible.
>>>>>>
>>>>>> In fact, the case with the new instance is easily triggered by 'stop'
>>>>>> mode backups. Maybe we should fix that up first before adding a
>>>>>> timeout
>>>>>> here?
>>>>>>
>>>>>> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
>>>>>> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
>>>>>> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
>>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
>>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
>>>>>> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
>>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
>>>>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running
>>>>>
>>>>> does this mean we should actually have some sort of mechanism
>>>>> similar to
>>>>> the reboot flag to indicate a pending cleanup, and block/delay
>>>>> starts if
>>>>> it is still set?
>>>>
>>>> Blocking/delaying starts is not what happens for the reboot flag/file:
>>>
>>> that's not what I meant, the similarity was just "have a flag", not
>>> "have a flag that behaves identical" ;)
>>>
>>> my proposal was:
>>> - add a flag that indicates cleanup is pending (similar to reboot is
>>>    pending)
>>> - *handle that flag* in the start flow to wait for the cleanup to be
>>>    done before starting
>>
>> Shouldn't we change the reboot flag to also do this?
>>
>>>>> Feb 13 14:00:16 pve9a1 qm[124470]: <root@pam> starting task
>>>>> UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam:
>>>>> Feb 13 14:00:16 pve9a1 qm[124472]: <root@pam> starting task
>>>>> UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>>>>> Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102:
>>>>> UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>>>>> [...]
>>>>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated
>>>>> successfully.
>>>>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s
>>>>> CPU time, 2G memory peak.
>>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102
>>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock...
>>>>> Feb 13 14:00:23 pve9a1 qm[124470]: <root@pam> end task
>>>>> UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: OK
>>>>> Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope.
>>>>> Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620.
>>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]:  OK
>>>>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running
>>>>
>>>> Currently, it's just indicating whether the cleanup handler should
>>>> start
>>>> the VM again afterwards.
>>>>
>>>> Am 13.02.26 um 1:22 PM schrieb Dominik Csapak:
>>>>> Sounds good, one possibility would be to do no cleanup at all when
>>>>> doing
>>>>> a stop mode backup?
>>>>> We already know we'll need the resources (pid/socket/etc. files,
>>>>> vgpus,...) again?
>>>>>
>>>>> Or is there some situation where that might not be the case?
>>>>
>>>> We do it for reboot (if not another start task sneaks in like in my
>>>> example above), and I don't see a good reason from the top of my head
>>>> why 'stop' mode backup should behave differently from a reboot (for
>>>> running VMs). It even applies pending changes just like a reboot
>>>> right now.
>>>
>>> but what about external callers doing something like:
>>>
>>> - stop
>>> - do whatever
>>> - start
>>>
>>> in rapid (automated) succession? those would still (possibly) trigger
>>> cleanup after "doing whatever" and starting the VM again already? and in
>>> particular if we skip cleanup for "our" cases of stop;start it will be
>>> easy to introduce sideeffects in cleanup that break such usage?
>>
>> I did not argue for skipping cleanup. I argued for being consistent with
>> reboot where we (try to) do cleanup. I just wasn't sure it's really
>> needed.
>>
>>>> I'm not sure if there is an actual need to do cleanup or if we could
>>
>> I guess the actual need is to have more consistent behavior.
>>
> 
> ok so i think we'd need to
> * create a cleanup flag for each vm when qmevent detects a vm shutting
> down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp)
> * removing that cleanup flag after cleanup (obviously)
> * on start, check for that flag and block for some timeout before
> starting (e.g. check the timestamp in the flag if it's longer than some
> time, start it regardless?)

Sounds good to me.

Unfortunately, something else: turns out that we kinda rely on qmeventd
not doing the cleanup for the optimization with keeping the volumes
active (i.e. $keepActive). And actually, the optimization applies
randomly depending on who wins the race.

Output below with added log line
"doing cleanup for $vmid with keepActive=$keepActive"
in vm_stop_cleanup() to be able to see what happens.

We try to use the optimization but qmeventd interferes:

> Feb 19 14:09:43 pve9a1 vzdump[168878]: <root@pam> starting task UPID:pve9a1:000293AF:0017CFF8:69970B97:vzdump:102:root@pam:
> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: starting new backup job: vzdump 102 --storage pbs --mode stop
> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: Starting Backup of VM 102 (qemu)
> Feb 19 14:09:44 pve9a1 qm[168960]: shutdown VM 102: UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
> Feb 19 14:09:44 pve9a1 qm[168959]: <root@pam> starting task UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
> Feb 19 14:09:47 pve9a1 qm[168960]: VM 102 qga command failed - VM 102 qga command 'guest-ping' failed - got timeout
> Feb 19 14:09:50 pve9a1 qmeventd[166736]: read: Connection reset by peer
> Feb 19 14:09:50 pve9a1 pvedaemon[166884]: <root@pam> end task UPID:pve9a1:000290CD:0017B515:69970B52:vncproxy:102:root@pam: OK
> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Consumed 41.780s CPU time, 1.9G memory peak.
> Feb 19 14:09:51 pve9a1 qm[168960]: doing cleanup for 102 with keepActive=1
> Feb 19 14:09:51 pve9a1 qm[168959]: <root@pam> end task UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: OK
> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Starting cleanup for 102
> Feb 19 14:09:51 pve9a1 qm[168986]: doing cleanup for 102 with keepActive=0
> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Finished cleanup for 102
> Feb 19 14:09:51 pve9a1 systemd[1]: Started 102.scope.
> Feb 19 14:09:51 pve9a1 vzdump[168879]: VM 102 started with PID 169021.

We manage to get the optimization:

> Feb 19 14:16:01 pve9a1 qm[174585]: shutdown VM 102: UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam:
> Feb 19 14:16:04 pve9a1 qm[174585]: VM 102 qga command failed - VM 102 qga command 'guest-ping' failed - got timeout
> Feb 19 14:16:07 pve9a1 qmeventd[166736]: read: Connection reset by peer
> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Consumed 46.363s CPU time, 2G memory peak.
> Feb 19 14:16:08 pve9a1 qm[174585]: doing cleanup for 102 with keepActive=1
> Feb 19 14:16:08 pve9a1 qm[174582]: <root@pam> end task UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: OK
> Feb 19 14:16:08 pve9a1 systemd[1]: Started 102.scope.
> Feb 19 14:16:08 pve9a1 qmeventd[174685]: Starting cleanup for 102
> Feb 19 14:16:08 pve9a1 qmeventd[174685]: trying to acquire lock...
> Feb 19 14:16:08 pve9a1 vzdump[174326]: VM 102 started with PID 174718.
> Feb 19 14:16:08 pve9a1 qmeventd[174685]:  OK
> Feb 19 14:16:08 pve9a1 qmeventd[174685]: vm still running

For regular shutdown, we'll also do the cleanup twice.

Maybe we also need a way to tell qmeventd that we already did the cleanup?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-19 13:27             ` Fiona Ebner
@ 2026-02-20  9:36               ` Dominik Csapak
  2026-02-20 14:30                 ` Fiona Ebner
  0 siblings, 1 reply; 14+ messages in thread
From: Dominik Csapak @ 2026-02-20  9:36 UTC (permalink / raw)
  To: Fiona Ebner, Fabian Grünbichler, pve-devel



On 2/19/26 2:27 PM, Fiona Ebner wrote:
> Am 19.02.26 um 11:15 AM schrieb Dominik Csapak:
>> On 2/16/26 10:15 AM, Fiona Ebner wrote:
>>> Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler:
>>>> On February 13, 2026 2:16 pm, Fiona Ebner wrote:
>>>>> Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler:
>>>>>> On February 13, 2026 1:14 pm, Fiona Ebner wrote:
>>>>>>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:

[snip]

>>>
>>> I guess the actual need is to have more consistent behavior.
>>>
>>
>> ok so i think we'd need to
>> * create a cleanup flag for each vm when qmevent detects a vm shutting
>> down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp)
>> * removing that cleanup flag after cleanup (obviously)
>> * on start, check for that flag and block for some timeout before
>> starting (e.g. check the timestamp in the flag if it's longer than some
>> time, start it regardless?)
> 
> Sounds good to me.
> 
> Unfortunately, something else: turns out that we kinda rely on qmeventd
> not doing the cleanup for the optimization with keeping the volumes
> active (i.e. $keepActive). And actually, the optimization applies
> randomly depending on who wins the race.
> 
> Output below with added log line
> "doing cleanup for $vmid with keepActive=$keepActive"
> in vm_stop_cleanup() to be able to see what happens.
> 
> We try to use the optimization but qmeventd interferes:
> 
>> Feb 19 14:09:43 pve9a1 vzdump[168878]: <root@pam> starting task UPID:pve9a1:000293AF:0017CFF8:69970B97:vzdump:102:root@pam:
>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: starting new backup job: vzdump 102 --storage pbs --mode stop
>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: Starting Backup of VM 102 (qemu)
>> Feb 19 14:09:44 pve9a1 qm[168960]: shutdown VM 102: UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
>> Feb 19 14:09:44 pve9a1 qm[168959]: <root@pam> starting task UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
>> Feb 19 14:09:47 pve9a1 qm[168960]: VM 102 qga command failed - VM 102 qga command 'guest-ping' failed - got timeout
>> Feb 19 14:09:50 pve9a1 qmeventd[166736]: read: Connection reset by peer
>> Feb 19 14:09:50 pve9a1 pvedaemon[166884]: <root@pam> end task UPID:pve9a1:000290CD:0017B515:69970B52:vncproxy:102:root@pam: OK
>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Consumed 41.780s CPU time, 1.9G memory peak.
>> Feb 19 14:09:51 pve9a1 qm[168960]: doing cleanup for 102 with keepActive=1
>> Feb 19 14:09:51 pve9a1 qm[168959]: <root@pam> end task UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: OK
>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Starting cleanup for 102
>> Feb 19 14:09:51 pve9a1 qm[168986]: doing cleanup for 102 with keepActive=0
>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Finished cleanup for 102
>> Feb 19 14:09:51 pve9a1 systemd[1]: Started 102.scope.
>> Feb 19 14:09:51 pve9a1 vzdump[168879]: VM 102 started with PID 169021.
> 
> We manage to get the optimization:
> 
>> Feb 19 14:16:01 pve9a1 qm[174585]: shutdown VM 102: UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam:
>> Feb 19 14:16:04 pve9a1 qm[174585]: VM 102 qga command failed - VM 102 qga command 'guest-ping' failed - got timeout
>> Feb 19 14:16:07 pve9a1 qmeventd[166736]: read: Connection reset by peer
>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Consumed 46.363s CPU time, 2G memory peak.
>> Feb 19 14:16:08 pve9a1 qm[174585]: doing cleanup for 102 with keepActive=1
>> Feb 19 14:16:08 pve9a1 qm[174582]: <root@pam> end task UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: OK
>> Feb 19 14:16:08 pve9a1 systemd[1]: Started 102.scope.
>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: Starting cleanup for 102
>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: trying to acquire lock...
>> Feb 19 14:16:08 pve9a1 vzdump[174326]: VM 102 started with PID 174718.
>> Feb 19 14:16:08 pve9a1 qmeventd[174685]:  OK
>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: vm still running
> 
> For regular shutdown, we'll also do the cleanup twice.
> 
> Maybe we also need a way to tell qmeventd that we already did the cleanup?


ok well then i'd try to do something like this:

in

'vm_stop' we'll create a cleanup flag with timestamp + state (e.g. 'queued')

in vm_stop_cleanup we change/create the flag with
'started' and clear the flag after cleanup

(if it's here already in 'started' state within a timelimit, ignore it)

in vm_start we block until the cleanup flag is gone or until some timeout

in 'qm cleanup' we only start it if the flag does not exist

I think this should make the behavior consistent?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-20  9:36               ` Dominik Csapak
@ 2026-02-20 14:30                 ` Fiona Ebner
  2026-02-20 14:51                   ` Dominik Csapak
  0 siblings, 1 reply; 14+ messages in thread
From: Fiona Ebner @ 2026-02-20 14:30 UTC (permalink / raw)
  To: Dominik Csapak, Fabian Grünbichler, pve-devel

Am 20.02.26 um 10:36 AM schrieb Dominik Csapak:
> On 2/19/26 2:27 PM, Fiona Ebner wrote:
>> Am 19.02.26 um 11:15 AM schrieb Dominik Csapak:
>>> On 2/16/26 10:15 AM, Fiona Ebner wrote:
>>>> Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler:
>>>>> On February 13, 2026 2:16 pm, Fiona Ebner wrote:
>>>>
>>>> I guess the actual need is to have more consistent behavior.
>>>>
>>>
>>> ok so i think we'd need to
>>> * create a cleanup flag for each vm when qmevent detects a vm shutting
>>> down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp)
>>> * removing that cleanup flag after cleanup (obviously)
>>> * on start, check for that flag and block for some timeout before
>>> starting (e.g. check the timestamp in the flag if it's longer than some
>>> time, start it regardless?)
>>
>> Sounds good to me.
>>
>> Unfortunately, something else: turns out that we kinda rely on qmeventd
>> not doing the cleanup for the optimization with keeping the volumes
>> active (i.e. $keepActive). And actually, the optimization applies
>> randomly depending on who wins the race.
>>
>> Output below with added log line
>> "doing cleanup for $vmid with keepActive=$keepActive"
>> in vm_stop_cleanup() to be able to see what happens.
>>
>> We try to use the optimization but qmeventd interferes:
>>
>>> Feb 19 14:09:43 pve9a1 vzdump[168878]: <root@pam> starting task
>>> UPID:pve9a1:000293AF:0017CFF8:69970B97:vzdump:102:root@pam:
>>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: starting new backup job:
>>> vzdump 102 --storage pbs --mode stop
>>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: Starting Backup of VM
>>> 102 (qemu)
>>> Feb 19 14:09:44 pve9a1 qm[168960]: shutdown VM 102:
>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
>>> Feb 19 14:09:44 pve9a1 qm[168959]: <root@pam> starting task
>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
>>> Feb 19 14:09:47 pve9a1 qm[168960]: VM 102 qga command failed - VM 102
>>> qga command 'guest-ping' failed - got timeout
>>> Feb 19 14:09:50 pve9a1 qmeventd[166736]: read: Connection reset by peer
>>> Feb 19 14:09:50 pve9a1 pvedaemon[166884]: <root@pam> end task
>>> UPID:pve9a1:000290CD:0017B515:69970B52:vncproxy:102:root@pam: OK
>>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Consumed 41.780s CPU
>>> time, 1.9G memory peak.
>>> Feb 19 14:09:51 pve9a1 qm[168960]: doing cleanup for 102 with
>>> keepActive=1
>>> Feb 19 14:09:51 pve9a1 qm[168959]: <root@pam> end task
>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: OK
>>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Starting cleanup for 102
>>> Feb 19 14:09:51 pve9a1 qm[168986]: doing cleanup for 102 with
>>> keepActive=0
>>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Finished cleanup for 102
>>> Feb 19 14:09:51 pve9a1 systemd[1]: Started 102.scope.
>>> Feb 19 14:09:51 pve9a1 vzdump[168879]: VM 102 started with PID 169021.
>>
>> We manage to get the optimization:
>>
>>> Feb 19 14:16:01 pve9a1 qm[174585]: shutdown VM 102:
>>> UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam:
>>> Feb 19 14:16:04 pve9a1 qm[174585]: VM 102 qga command failed - VM 102
>>> qga command 'guest-ping' failed - got timeout
>>> Feb 19 14:16:07 pve9a1 qmeventd[166736]: read: Connection reset by peer
>>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Consumed 46.363s CPU
>>> time, 2G memory peak.
>>> Feb 19 14:16:08 pve9a1 qm[174585]: doing cleanup for 102 with
>>> keepActive=1
>>> Feb 19 14:16:08 pve9a1 qm[174582]: <root@pam> end task
>>> UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: OK
>>> Feb 19 14:16:08 pve9a1 systemd[1]: Started 102.scope.
>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: Starting cleanup for 102
>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: trying to acquire lock...
>>> Feb 19 14:16:08 pve9a1 vzdump[174326]: VM 102 started with PID 174718.
>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]:  OK
>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: vm still running
>>
>> For regular shutdown, we'll also do the cleanup twice.
>>
>> Maybe we also need a way to tell qmeventd that we already did the
>> cleanup?
> 
> 
> ok well then i'd try to do something like this:
> 
> in
> 
> 'vm_stop' we'll create a cleanup flag with timestamp + state (e.g.
> 'queued')
> 
> in vm_stop_cleanup we change/create the flag with
> 'started' and clear the flag after cleanup

Why is the one in vm_stop needed? Is there any advantage over creating
it directly in vm_stop_cleanup()?

> (if it's here already in 'started' state within a timelimit, ignore it)
> 
> in vm_start we block until the cleanup flag is gone or until some timeout
> 
> in 'qm cleanup' we only start it if the flag does not exist

Hmm, it does also call vm_stop_cleanup() so we could just re-use the
check there for that part? I guess doing an early check doesn't hurt
either, as long as we do call the post-stop hook.

> I think this should make the behavior consistent?






^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
  2026-02-20 14:30                 ` Fiona Ebner
@ 2026-02-20 14:51                   ` Dominik Csapak
  0 siblings, 0 replies; 14+ messages in thread
From: Dominik Csapak @ 2026-02-20 14:51 UTC (permalink / raw)
  To: Fiona Ebner, Fabian Grünbichler, pve-devel



On 2/20/26 3:30 PM, Fiona Ebner wrote:
> Am 20.02.26 um 10:36 AM schrieb Dominik Csapak:
>> On 2/19/26 2:27 PM, Fiona Ebner wrote:
>>> Am 19.02.26 um 11:15 AM schrieb Dominik Csapak:
>>>> On 2/16/26 10:15 AM, Fiona Ebner wrote:
>>>>> Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler:
>>>>>> On February 13, 2026 2:16 pm, Fiona Ebner wrote:
>>>>>
>>>>> I guess the actual need is to have more consistent behavior.
>>>>>
>>>>
>>>> ok so i think we'd need to
>>>> * create a cleanup flag for each vm when qmevent detects a vm shutting
>>>> down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp)
>>>> * removing that cleanup flag after cleanup (obviously)
>>>> * on start, check for that flag and block for some timeout before
>>>> starting (e.g. check the timestamp in the flag if it's longer than some
>>>> time, start it regardless?)
>>>
>>> Sounds good to me.
>>>
>>> Unfortunately, something else: turns out that we kinda rely on qmeventd
>>> not doing the cleanup for the optimization with keeping the volumes
>>> active (i.e. $keepActive). And actually, the optimization applies
>>> randomly depending on who wins the race.
>>>
>>> Output below with added log line
>>> "doing cleanup for $vmid with keepActive=$keepActive"
>>> in vm_stop_cleanup() to be able to see what happens.
>>>
>>> We try to use the optimization but qmeventd interferes:
>>>
>>>> Feb 19 14:09:43 pve9a1 vzdump[168878]: <root@pam> starting task
>>>> UPID:pve9a1:000293AF:0017CFF8:69970B97:vzdump:102:root@pam:
>>>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: starting new backup job:
>>>> vzdump 102 --storage pbs --mode stop
>>>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: Starting Backup of VM
>>>> 102 (qemu)
>>>> Feb 19 14:09:44 pve9a1 qm[168960]: shutdown VM 102:
>>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
>>>> Feb 19 14:09:44 pve9a1 qm[168959]: <root@pam> starting task
>>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
>>>> Feb 19 14:09:47 pve9a1 qm[168960]: VM 102 qga command failed - VM 102
>>>> qga command 'guest-ping' failed - got timeout
>>>> Feb 19 14:09:50 pve9a1 qmeventd[166736]: read: Connection reset by peer
>>>> Feb 19 14:09:50 pve9a1 pvedaemon[166884]: <root@pam> end task
>>>> UPID:pve9a1:000290CD:0017B515:69970B52:vncproxy:102:root@pam: OK
>>>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>>>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Consumed 41.780s CPU
>>>> time, 1.9G memory peak.
>>>> Feb 19 14:09:51 pve9a1 qm[168960]: doing cleanup for 102 with
>>>> keepActive=1
>>>> Feb 19 14:09:51 pve9a1 qm[168959]: <root@pam> end task
>>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: OK
>>>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Starting cleanup for 102
>>>> Feb 19 14:09:51 pve9a1 qm[168986]: doing cleanup for 102 with
>>>> keepActive=0
>>>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Finished cleanup for 102
>>>> Feb 19 14:09:51 pve9a1 systemd[1]: Started 102.scope.
>>>> Feb 19 14:09:51 pve9a1 vzdump[168879]: VM 102 started with PID 169021.
>>>
>>> We manage to get the optimization:
>>>
>>>> Feb 19 14:16:01 pve9a1 qm[174585]: shutdown VM 102:
>>>> UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam:
>>>> Feb 19 14:16:04 pve9a1 qm[174585]: VM 102 qga command failed - VM 102
>>>> qga command 'guest-ping' failed - got timeout
>>>> Feb 19 14:16:07 pve9a1 qmeventd[166736]: read: Connection reset by peer
>>>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>>>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Consumed 46.363s CPU
>>>> time, 2G memory peak.
>>>> Feb 19 14:16:08 pve9a1 qm[174585]: doing cleanup for 102 with
>>>> keepActive=1
>>>> Feb 19 14:16:08 pve9a1 qm[174582]: <root@pam> end task
>>>> UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: OK
>>>> Feb 19 14:16:08 pve9a1 systemd[1]: Started 102.scope.
>>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: Starting cleanup for 102
>>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: trying to acquire lock...
>>>> Feb 19 14:16:08 pve9a1 vzdump[174326]: VM 102 started with PID 174718.
>>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]:  OK
>>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: vm still running
>>>
>>> For regular shutdown, we'll also do the cleanup twice.
>>>
>>> Maybe we also need a way to tell qmeventd that we already did the
>>> cleanup?
>>
>>
>> ok well then i'd try to do something like this:
>>
>> in
>>
>> 'vm_stop' we'll create a cleanup flag with timestamp + state (e.g.
>> 'queued')
>>
>> in vm_stop_cleanup we change/create the flag with
>> 'started' and clear the flag after cleanup
> 
> Why is the one in vm_stop needed? Is there any advantage over creating
> it directly in vm_stop_cleanup()?
> 

after a bit of experimenting and re-reading the code, i think
I can simplify the logic

at the beginning of vm_stop, we create the cleanup flag
in 'qm cleanup', we only do the cleanup if the flag does not exist
in 'vm_start' we clean the flag

this should work because these parts are under a config lock anyway:
* from vm_stop to vm_stop_cleanup
* most of the qm cleanup code
* vm_start

so we only really have to mark that the cleanup was done from
the vm_stop codepath

(we have to create the flag at the beginning of vm_stop, because
then there is no race between calling it's cleanup and qmeventd
picking up the vanishing process)

does that make sense to you?

>> (if it's here already in 'started' state within a timelimit, ignore it)
>>
>> in vm_start we block until the cleanup flag is gone or until some timeout
>>
>> in 'qm cleanup' we only start it if the flag does not exist
> 
> Hmm, it does also call vm_stop_cleanup() so we could just re-use the
> check there for that part? I guess doing an early check doesn't hurt
> either, as long as we do call the post-stop hook.
> 
>> I think this should make the behavior consistent?
> 
> 





^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-02-20 14:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-10 11:15 [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds Dominik Csapak
2026-02-12 20:33 ` Benjamin McGuire
2026-02-13 11:40 ` Fabian Grünbichler
2026-02-13 12:14 ` Fiona Ebner
2026-02-13 12:20   ` Fabian Grünbichler
2026-02-13 13:16     ` Fiona Ebner
2026-02-16  8:42       ` Fabian Grünbichler
2026-02-16  9:15         ` Fiona Ebner
2026-02-19 10:15           ` Dominik Csapak
2026-02-19 13:27             ` Fiona Ebner
2026-02-20  9:36               ` Dominik Csapak
2026-02-20 14:30                 ` Fiona Ebner
2026-02-20 14:51                   ` Dominik Csapak
2026-02-13 12:22   ` Dominik Csapak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal