From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id A16E11FF141 for ; Fri, 13 Feb 2026 13:21:37 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B563C246D; Fri, 13 Feb 2026 13:22:24 +0100 (CET) Message-ID: <5246d854-03cf-4fe2-9f01-5dffa69aa96b@proxmox.com> Date: Fri, 13 Feb 2026 13:22:19 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Beta Subject: Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds To: Fiona Ebner , pve-devel@lists.proxmox.com References: <20260210111612.2017883-1-d.csapak@proxmox.com> <7ee8d206-36fd-4ade-893b-c7c2222a8883@proxmox.com> Content-Language: en-US From: Dominik Csapak In-Reply-To: <7ee8d206-36fd-4ade-893b-c7c2222a8883@proxmox.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1770985336602 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.032 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: OVRRR23YX2QDIBKIVJM4YIBSBYJKWUPU X-Message-ID-Hash: OVRRR23YX2QDIBKIVJM4YIBSBYJKWUPU X-MailFrom: d.csapak@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 2/13/26 1:14 PM, Fiona Ebner wrote: > Am 10.02.26 um 12:14 PM schrieb Dominik Csapak: >> When qmeventd detects a vm exiting, it starts 'qm cleanup' to cleanup >> files, executing hookscripts, etc. >> >> Since the vm process exits is sometimes not instant, wait up to 30 >> seconds here to start the cleanup process instead of immediately >> aborting if the pid still exits. This prevented executing the hookscript >> on the 'post-stop' phase. >> >> This can be easily reproduced by e.g. passing through a usb device, >> which delays the qemu process exit for a few seconds. >> >> Signed-off-by: Dominik Csapak >> --- >> changes from v1: >> * use correct while condition (time() is always >= $starttime) >> >> original comment: >> >> The 30 second timeout was arbitrarily chosen, but we could probably >> start with something smaller, like 10 seconds? Could be adapted on >> applying though. >> >> In my (short) tests the usb passthrough part only adds a single second, >> but i can imagine different devices on other systems could block it for >> much longer. >> >> src/PVE/CLI/qm.pm | 13 ++++++++++++- >> 1 file changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm >> index bdae9641..16875ed2 100755 >> --- a/src/PVE/CLI/qm.pm >> +++ b/src/PVE/CLI/qm.pm >> @@ -1101,8 +1101,19 @@ __PACKAGE__->register_method({ >> 60, >> sub { >> my $conf = PVE::QemuConfig->load_config($vmid); >> + >> + # wait for some timeout until vm process exits, since this might not be instant > > s/timeout/time/ > > Nit: s/vm/the QEMU/ > > Maybe add "after the QMP 'SHUTDOWN' event"? > >> + my $timeout = 30; >> + my $starttime = time(); >> my $pid = PVE::QemuServer::check_running($vmid); >> - die "vm still running\n" if $pid; >> + warn "vm still running - waiting up to $timeout seconds\n" if $pid; > > While we're at it, we could improve the message here. Something like > 'QEMU process $pid for VM $vmid still running (or newly started)' > Having the PID is nice info for developers/support engineers and the > case where a new instance is started before the cleanup was done is also > possible. > > In fact, the case with the new instance is easily triggered by 'stop' > mode backups. Maybe we should fix that up first before adding a timeout > here? > > Feb 13 13:09:48 pve9a1 qm[92975]: end task > UPID:pve9a1:00016B30:000CDF80:698F1485:qmshutdown:102:root@pam: OK > Feb 13 13:09:48 pve9a1 systemd[1]: Started 102.scope. > Feb 13 13:09:48 pve9a1 qmeventd[93079]: Starting cleanup for 102 > Feb 13 13:09:48 pve9a1 qmeventd[93079]: trying to acquire lock... > Feb 13 13:09:48 pve9a1 vzdump[92895]: VM 102 started with PID 93116. > Feb 13 13:09:48 pve9a1 qmeventd[93079]: OK > Feb 13 13:09:48 pve9a1 qmeventd[93079]: vm still running > Sounds good, one possibility would be to do no cleanup at all when doing a stop mode backup? We already know we'll need the resources (pid/socket/etc. files, vgpus,...) again? Or is there some situation where that might not be the case? > >> + >> + while ($pid && (time() - $starttime) < $timeout) { >> + sleep(1); >> + $pid = PVE::QemuServer::check_running($vmid); >> + } >> + >> + die "vm still running - aborting cleanup\n" if $pid; >> >> # Rollback already does cleanup when preparing and afterwards temporarily drops the >> # lock on the configuration file to rollback the volumes. Deactivating volumes here >