From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id ECD991FF142
	for <inbox@lore.proxmox.com>; Mon, 16 Feb 2026 09:41:33 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id E96D7808B;
	Mon, 16 Feb 2026 09:42:18 +0100 (CET)
Date: Mon, 16 Feb 2026 09:42:10 +0100
From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= <f.gruenbichler@proxmox.com>
Subject: Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process
 exiting for up to 30 seconds
To: Dominik Csapak <d.csapak@proxmox.com>, Fiona Ebner <f.ebner@proxmox.com>,
	pve-devel@lists.proxmox.com
References: <20260210111612.2017883-1-d.csapak@proxmox.com>
	<7ee8d206-36fd-4ade-893b-c7c2222a8883@proxmox.com>
	<1770985110.nme4v4xomn.astroid@yuna.none>
	<9d501c98-a85c-44d4-af0e-0301b203d691@proxmox.com>
In-Reply-To: <9d501c98-a85c-44d4-af0e-0301b203d691@proxmox.com>
MIME-Version: 1.0
User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid)
Message-Id: <1771231158.rte62d97r5.astroid@yuna.none>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1771231327572
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.046 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: WBZWXHLPUB4CADM7I66RZUMTLBWCWFQP
X-Message-ID-Hash: WBZWXHLPUB4CADM7I66RZUMTLBWCWFQP
X-MailFrom: f.gruenbichler@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

On February 13, 2026 2:16 pm, Fiona Ebner wrote:
> Am 13.02.26 um 1:20 PM schrieb Fabian Gr=C3=BCnbichler:
>> On February 13, 2026 1:14 pm, Fiona Ebner wrote:
>>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak:
>>>> +                my $timeout =3D 30;
>>>> +                my $starttime =3D time();
>>>>                  my $pid =3D PVE::QemuServer::check_running($vmid);
>>>> -                die "vm still running\n" if $pid;
>>>> +                warn "vm still running - waiting up to $timeout secon=
ds\n" if $pid;
>>>
>>> While we're at it, we could improve the message here. Something like
>>> 'QEMU process $pid for VM $vmid still running (or newly started)'
>>> Having the PID is nice info for developers/support engineers and the
>>> case where a new instance is started before the cleanup was done is als=
o
>>> possible.
>>>
>>> In fact, the case with the new instance is easily triggered by 'stop'
>>> mode backups. Maybe we should fix that up first before adding a timeout
>>> here?
>>>
>>> Feb 13 13:09:48 pve9a1 qm[92975]: <root@pam> end task
>>> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK
>>> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope.
>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102
>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock...
>>> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116.
>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]:  OK
>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running
>>=20
>> does this mean we should actually have some sort of mechanism similar to
>> the reboot flag to indicate a pending cleanup, and block/delay starts if
>> it is still set?
>=20
> Blocking/delaying starts is not what happens for the reboot flag/file:

that's not what I meant, the similarity was just "have a flag", not
"have a flag that behaves identical" ;)

my proposal was:
- add a flag that indicates cleanup is pending (similar to reboot is
  pending)
- *handle that flag* in the start flow to wait for the cleanup to be
  done before starting

>> Feb 13 14:00:16 pve9a1 qm[124470]: <root@pam> starting task UPID:pve9a1:=
0001E639:001180FE:698F2060:qmreboot:102:root@pam:
>> Feb 13 14:00:16 pve9a1 qm[124472]: <root@pam> starting task UPID:pve9a1:=
0001E63A:0011811E:698F2060:qmstart:102:root@pam:
>> Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102: UPID:pve9a1:0001E63A:00=
11811E:698F2060:qmstart:102:root@pam:
>> [...]
>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s CPU t=
ime, 2G memory peak.
>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102
>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock...
>> Feb 13 14:00:23 pve9a1 qm[124470]: <root@pam> end task UPID:pve9a1:0001E=
639:001180FE:698F2060:qmreboot:102:root@pam: OK
>> Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope.
>> Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620.
>> Feb 13 14:00:23 pve9a1 qmeventd[124565]:  OK
>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running
>=20
> Currently, it's just indicating whether the cleanup handler should start
> the VM again afterwards.
>=20
> Am 13.02.26 um 1:22 PM schrieb Dominik Csapak:
>> Sounds good, one possibility would be to do no cleanup at all when doing
>> a stop mode backup?
>> We already know we'll need the resources (pid/socket/etc. files, vgpus,.=
..) again?
>>=20
>> Or is there some situation where that might not be the case?=20
>=20
> We do it for reboot (if not another start task sneaks in like in my
> example above), and I don't see a good reason from the top of my head
> why 'stop' mode backup should behave differently from a reboot (for
> running VMs). It even applies pending changes just like a reboot right no=
w.

but what about external callers doing something like:

- stop
- do whatever
- start

in rapid (automated) succession? those would still (possibly) trigger
cleanup after "doing whatever" and starting the VM again already? and in
particular if we skip cleanup for "our" cases of stop;start it will be
easy to introduce sideeffects in cleanup that break such usage?

> I'm not sure if there is an actual need to do cleanup or if we could
> also skip it when we are planning to spin up another instance right
> away. But we do it for reboot, so the "safe" variant is also doing it
> for 'stop' mode backup. History tells me it's been there since the
> reboot functionality was added:
> https://lists.proxmox.com/pipermail/pve-devel/2019-September/038988.html
>=20