From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id F01F71FF142 for ; Mon, 16 Feb 2026 10:15:22 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 7E18794A3; Mon, 16 Feb 2026 10:16:10 +0100 (CET) Message-ID: <38236a30-a249-4ebe-bf89-788d67f36bd1@proxmox.com> Date: Mon, 16 Feb 2026 10:15:57 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= , Dominik Csapak , pve-devel@lists.proxmox.com References: <20260210111612.2017883-1-d.csapak@proxmox.com> <7ee8d206-36fd-4ade-893b-c7c2222a8883@proxmox.com> <1770985110.nme4v4xomn.astroid@yuna.none> <9d501c98-a85c-44d4-af0e-0301b203d691@proxmox.com> <1771231158.rte62d97r5.astroid@yuna.none> Content-Language: en-US From: Fiona Ebner In-Reply-To: <1771231158.rte62d97r5.astroid@yuna.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1771233357234 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.016 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: YEYWU5CWDCLBE35YFEO6OH57L2U4V4AB X-Message-ID-Hash: YEYWU5CWDCLBE35YFEO6OH57L2U4V4AB X-MailFrom: f.ebner@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler: > On February 13, 2026 2:16 pm, Fiona Ebner wrote: >> Am 13.02.26 um 1:20 PM schrieb Fabian Grünbichler: >>> On February 13, 2026 1:14 pm, Fiona Ebner wrote: >>>> Am 10.02.26 um 12:14 PM schrieb Dominik Csapak: >>>>> + my $timeout = 30; >>>>> + my $starttime = time(); >>>>> my $pid = PVE::QemuServer::check_running($vmid); >>>>> - die "vm still running\n" if $pid; >>>>> + warn "vm still running - waiting up to $timeout seconds\n" if $pid; >>>> >>>> While we're at it, we could improve the message here. Something like >>>> 'QEMU process $pid for VM $vmid still running (or newly started)' >>>> Having the PID is nice info for developers/support engineers and the >>>> case where a new instance is started before the cleanup was done is also >>>> possible. >>>> >>>> In fact, the case with the new instance is easily triggered by 'stop' >>>> mode backups. Maybe we should fix that up first before adding a timeout >>>> here? >>>> >>>> Feb 13 13:09:48 pve9a1 qm[92975]: end task >>>> UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK >>>> Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope. >>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102 >>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock... >>>> Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116. >>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: OK >>>> Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running >>> >>> does this mean we should actually have some sort of mechanism similar to >>> the reboot flag to indicate a pending cleanup, and block/delay starts if >>> it is still set? >> >> Blocking/delaying starts is not what happens for the reboot flag/file: > > that's not what I meant, the similarity was just "have a flag", not > "have a flag that behaves identical" ;) > > my proposal was: > - add a flag that indicates cleanup is pending (similar to reboot is > pending) > - *handle that flag* in the start flow to wait for the cleanup to be > done before starting Shouldn't we change the reboot flag to also do this? >>> Feb 13 14:00:16 pve9a1 qm[124470]: starting task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: >>> Feb 13 14:00:16 pve9a1 qm[124472]: starting task UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam: >>> Feb 13 14:00:16 pve9a1 qm[124474]: start VM 102: UPID:pve9a1:0001E63A:0011811E:698F2060:qmstart:102:root@pam: >>> [...] >>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Deactivated successfully. >>> Feb 13 14:00:22 pve9a1 systemd[1]: 102.scope: Consumed 2min 3.333s CPU time, 2G memory peak. >>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: Starting cleanup for 102 >>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: trying to acquire lock... >>> Feb 13 14:00:23 pve9a1 qm[124470]: end task UPID:pve9a1:0001E639:001180FE:698F2060:qmreboot:102:root@pam: OK >>> Feb 13 14:00:23 pve9a1 systemd[1]: Started 102.scope. >>> Feb 13 14:00:23 pve9a1 qm[124474]: VM 102 started with PID 124620. >>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: OK >>> Feb 13 14:00:23 pve9a1 qmeventd[124565]: vm still running >> >> Currently, it's just indicating whether the cleanup handler should start >> the VM again afterwards. >> >> Am 13.02.26 um 1:22 PM schrieb Dominik Csapak: >>> Sounds good, one possibility would be to do no cleanup at all when doing >>> a stop mode backup? >>> We already know we'll need the resources (pid/socket/etc. files, vgpus,...) again? >>> >>> Or is there some situation where that might not be the case? >> >> We do it for reboot (if not another start task sneaks in like in my >> example above), and I don't see a good reason from the top of my head >> why 'stop' mode backup should behave differently from a reboot (for >> running VMs). It even applies pending changes just like a reboot right now. > > but what about external callers doing something like: > > - stop > - do whatever > - start > > in rapid (automated) succession? those would still (possibly) trigger > cleanup after "doing whatever" and starting the VM again already? and in > particular if we skip cleanup for "our" cases of stop;start it will be > easy to introduce sideeffects in cleanup that break such usage? I did not argue for skipping cleanup. I argued for being consistent with reboot where we (try to) do cleanup. I just wasn't sure it's really needed. >> I'm not sure if there is an actual need to do cleanup or if we could I guess the actual need is to have more consistent behavior. >> also skip it when we are planning to spin up another instance right >> away. But we do it for reboot, so the "safe" variant is also doing it >> for 'stop' mode backup. History tells me it's been there since the >> reboot functionality was added: >> https://lists.proxmox.com/pipermail/pve-devel/2019-September/038988.html