Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds

all lists on lists.proxmox.com
 help / color / mirror / Atom feed

From: Benjamin McGuire <jaminmc@gmail.com>
To: Dominik Csapak <d.csapak@proxmox.com>
Cc: pve-devel@lists.proxmox.com
Subject: Re: [PATCH qemu-server v2] fix #7119: qm cleanup: wait for process exiting for up to 30 seconds
Date: Fri, 20 Feb 2026 22:11:47 -0500	[thread overview]
Message-ID: <A06B1824-27F4-4410-BA0A-D0FF6D6E5B21@gmail.com> (raw)
In-Reply-To: <c1670bdd-807b-46e7-92fe-e8ecc866eea7@proxmox.com>

I’ve come up with a different patch where the wait happens in the child process that runs the qm cleanup (With help from Cursor). 

Bug #7119:
When qmeventd notices a VM’s QMP socket closing, it immediately splits off and runs qm cleanup. But, the QEMU process might not have completely finished yet—for instance, USB passthrough might take a few seconds to shut down after the socket closes. This makes qm cleanup think the process is still running and stop, skipping the post-stop hookscript, and possibly having an incomplete cleanup.

To fix this, we can wait for the QEMU process to actually exit in the child process before running qm cleanup.
This works with pidfd + poll(), which:
 * Returns right away if the process is already dead (which is usually the case)
 * Pauses for up to 10 seconds if the teardown is slow (like with USB passthrough)
 * Runs in the child process, so qmeventd’s event loop isn’t blocked, or slowed down

For guest-initiated shutdowns (where terminate_client() wasn’t called and no pidfd is in the client struct), we open a new pidfd using pidfd_open(). If pidfd_open() fails (the process is already dead, or the kernel doesn’t support pidfd), we skip the wait and go ahead—the same as before this patch.

We set the client’s pidfd to -1 before removing it from the hash table to stop cleanup_client() from closing the fd twice.

This way, we avoid any Perl-side polling or sleeping while holding the VM config lock, which was the way we did it in the v2 patch.

It my test, with my USB receiver passed through, It only needed to wait for 1 - 2 seconds. So I made the timeout 10 seconds instead of 30. We could even make it 5. But slower computers may need more time for the USB teardown.

Feb 20 21:30:53 Froto qmeventd[3129101]: read: Connection reset by peer
  This is where the QMP socket was closed from the guest initiated shutdown.
Feb 20 21:30:55 Froto qmeventd[3135521]: Starting cleanup for 100
  2 seconds later, The cleanup was started.
Feb 20 21:30:56 Froto qmeventd[3135521]: [VM 100 GPU Recovery] HOOK CALLED – VM: 100  Phase: post-stop  Time: 2026-02-20 21:30:56
Hookscript was started 1 second later.
Feb 20 21:30:56 Froto qmeventd[3135521]: [VM 100 GPU Recovery] Starting GPU recovery (post-stop phase)
Feb 20 21:30:56 Froto qmeventd[3135521]: [VM 100 GPU Recovery] Detected PCI devices for recovery: 0000:03:00.0 0000:03:00.1 0000:12:00.6
Feb 20 21:30:56 Froto qmeventd[3135521]: [VM 100 GPU Recovery] Detected Vega 10 series GPU (video): 0000:03:00.0 — using light recovery (no remove/rescan)
Feb 20 21:30:56 Froto qmeventd[3135521]: [VM 100 GPU Recovery]   → Attempt 1/12 (Vega light mode)
Feb 20 21:30:57 Froto qmeventd[3135521]: [VM 100 GPU Recovery]   SUCCESS: 0000:03:00.0 recovered on attempt 1!
Feb 20 21:30:57 Froto qmeventd[3135521]: [VM 100 GPU Recovery] GPU recovery complete
  Hook script complete
Feb 20 21:30:57 Froto qmeventd[3135521]: Finished cleanup for 100
  Cleanup complete.


Also tested with no passthrough and it still works as it should.
Feb 20 22:00:00 Froto qmeventd[3129101]: read: Connection reset by peer
Feb 20 22:00:00 Froto qmeventd[3166503]: Starting cleanup for 300
Feb 20 22:00:00 Froto qmeventd[3166503]: [VM 300 GPU Recovery] HOOK CALLED – VM: 300  Phase: post-stop  Time: 2026-02-20 22:00:00
Feb 20 22:00:00 Froto qmeventd[3166503]: [VM 300 GPU Recovery] Starting GPU recovery (post-stop phase)
Feb 20 22:00:01 Froto qmeventd[3166503]: [VM 300 GPU Recovery] No PCI passthrough devices found.
Feb 20 22:00:01 Froto qmeventd[3166503]: Finished cleanup for 300


diff --git a/src/qmeventd/qmeventd.c b/src/qmeventd/qmeventd.c
index 1d9eb74a..d9d26586 100644
--- a/src/qmeventd/qmeventd.c
+++ b/src/qmeventd/qmeventd.c
@@ -31,6 +31,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <poll.h>
 #include <sys/epoll.h>
 #include <sys/socket.h>
 #include <sys/types.h>
@@ -370,23 +371,41 @@ static void cleanup_qemu_client(struct Client *client) {
     unsigned short guest = client->qemu.guest;
     char vmid[sizeof(client->qemu.vmid)];
     strncpy(vmid, client->qemu.vmid, sizeof(vmid));
+
+    int pidfd = client->pidfd;
+    client->pidfd = -1;
+
+    if (pidfd <= 0) {
+        pidfd = pidfd_open(client->pid, 0);
+    }
+
     g_hash_table_remove(vm_clients, &vmid); // frees key, ignore errors
     VERBOSE_PRINT("%s: executing cleanup (graceful: %d, guest: %d)\n", vmid, graceful, guest);
 
     int pid = fork();
     if (pid < 0) {
         fprintf(stderr, "fork failed: %s\n", strerror(errno));
+        if (pidfd > 0) (void)close(pidfd);
         return;
     }
     if (pid == 0) {
-        char *script = "/usr/sbin/qm";
+        if (pidfd > 0) {
+            struct pollfd pfd = { .fd = pidfd, .events = POLLIN };
+            if (poll(&pfd, 1, 10 * 1000) < 0) {
+                perror("poll on pidfd");
+            }
+            (void)close(pidfd);
+        }
 
+        char *script = "/usr/sbin/qm";
         char *args[] = {script, "cleanup", vmid, graceful ? "1" : "0", guest ? "1" : "0", NULL};
-
         execvp(script, args);
         perror("execvp");
         _exit(1);
     }
+
+    // parent
+    if (pidfd > 0) (void)close(pidfd);
 }
 
 void cleanup_client(struct Client *client) {


> On Feb 20, 2026, at 9:51 AM, Dominik Csapak <d.csapak@proxmox.com> wrote:
> 
> 
> 
> On 2/20/26 3:30 PM, Fiona Ebner wrote:
>> Am 20.02.26 um 10:36 AM schrieb Dominik Csapak:
>>> On 2/19/26 2:27 PM, Fiona Ebner wrote:
>>>> Am 19.02.26 um 11:15 AM schrieb Dominik Csapak:
>>>>> On 2/16/26 10:15 AM, Fiona Ebner wrote:
>>>>>> Am 16.02.26 um 9:42 AM schrieb Fabian Grünbichler:
>>>>>>> On February 13, 2026 2:16 pm, Fiona Ebner wrote:
>>>>>> 
>>>>>> I guess the actual need is to have more consistent behavior.
>>>>>> 
>>>>> 
>>>>> ok so i think we'd need to
>>>>> * create a cleanup flag for each vm when qmevent detects a vm shutting
>>>>> down (in /var/run/qemu-server/VMID.cleanup, possibly with timestamp)
>>>>> * removing that cleanup flag after cleanup (obviously)
>>>>> * on start, check for that flag and block for some timeout before
>>>>> starting (e.g. check the timestamp in the flag if it's longer than some
>>>>> time, start it regardless?)
>>>> 
>>>> Sounds good to me.
>>>> 
The qmeventd.c patch makes this unnecessary, as the process is already closed.
>>>> Unfortunately, something else: turns out that we kinda rely on qmeventd
>>>> not doing the cleanup for the optimization with keeping the volumes
>>>> active (i.e. $keepActive). And actually, the optimization applies
>>>> randomly depending on who wins the race.
>>>> 
>>>> Output below with added log line
>>>> "doing cleanup for $vmid with keepActive=$keepActive"
>>>> in vm_stop_cleanup() to be able to see what happens.
>>>> 
>>>> We try to use the optimization but qmeventd interferes:
>>>> 
>>>>> Feb 19 14:09:43 pve9a1 vzdump[168878]: <root@pam> starting task
>>>>> UPID:pve9a1:000293AF:0017CFF8:69970B97:vzdump:102:root@pam:
>>>>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: starting new backup job:
>>>>> vzdump 102 --storage pbs --mode stop
>>>>> Feb 19 14:09:43 pve9a1 vzdump[168879]: INFO: Starting Backup of VM
>>>>> 102 (qemu)
>>>>> Feb 19 14:09:44 pve9a1 qm[168960]: shutdown VM 102:
>>>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
>>>>> Feb 19 14:09:44 pve9a1 qm[168959]: <root@pam> starting task
>>>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam:
>>>>> Feb 19 14:09:47 pve9a1 qm[168960]: VM 102 qga command failed - VM 102
>>>>> qga command 'guest-ping' failed - got timeout
>>>>> Feb 19 14:09:50 pve9a1 qmeventd[166736]: read: Connection reset by peer
>>>>> Feb 19 14:09:50 pve9a1 pvedaemon[166884]: <root@pam> end task
>>>>> UPID:pve9a1:000290CD:0017B515:69970B52:vncproxy:102:root@pam: OK
>>>>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>>>>> Feb 19 14:09:50 pve9a1 systemd[1]: 102.scope: Consumed 41.780s CPU
>>>>> time, 1.9G memory peak.
>>>>> Feb 19 14:09:51 pve9a1 qm[168960]: doing cleanup for 102 with
>>>>> keepActive=1
>>>>> Feb 19 14:09:51 pve9a1 qm[168959]: <root@pam> end task
>>>>> UPID:pve9a1:00029400:0017D035:69970B98:qmshutdown:102:root@pam: OK
>>>>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Starting cleanup for 102
>>>>> Feb 19 14:09:51 pve9a1 qm[168986]: doing cleanup for 102 with
>>>>> keepActive=0
>>>>> Feb 19 14:09:51 pve9a1 qmeventd[168986]: Finished cleanup for 102
>>>>> Feb 19 14:09:51 pve9a1 systemd[1]: Started 102.scope.
>>>>> Feb 19 14:09:51 pve9a1 vzdump[168879]: VM 102 started with PID 169021.
>>>> 
>>>> We manage to get the optimization:
>>>> 
>>>>> Feb 19 14:16:01 pve9a1 qm[174585]: shutdown VM 102:
>>>>> UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam:
>>>>> Feb 19 14:16:04 pve9a1 qm[174585]: VM 102 qga command failed - VM 102
>>>>> qga command 'guest-ping' failed - got timeout
>>>>> Feb 19 14:16:07 pve9a1 qmeventd[166736]: read: Connection reset by peer
>>>>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Deactivated successfully.
>>>>> Feb 19 14:16:07 pve9a1 systemd[1]: 102.scope: Consumed 46.363s CPU
>>>>> time, 2G memory peak.
>>>>> Feb 19 14:16:08 pve9a1 qm[174585]: doing cleanup for 102 with
>>>>> keepActive=1
>>>>> Feb 19 14:16:08 pve9a1 qm[174582]: <root@pam> end task
>>>>> UPID:pve9a1:0002A9F9:0018636B:69970D11:qmshutdown:102:root@pam: OK
>>>>> Feb 19 14:16:08 pve9a1 systemd[1]: Started 102.scope.
>>>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: Starting cleanup for 102
>>>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: trying to acquire lock...
>>>>> Feb 19 14:16:08 pve9a1 vzdump[174326]: VM 102 started with PID 174718.
>>>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]:  OK
>>>>> Feb 19 14:16:08 pve9a1 qmeventd[174685]: vm still running
>>>> 
>>>> For regular shutdown, we'll also do the cleanup twice.
>>>> 
>>>> Maybe we also need a way to tell qmeventd that we already did the
>>>> cleanup?
>>> 
>>> 
>>> ok well then i'd try to do something like this:
>>> 
>>> in
>>> 
>>> 'vm_stop' we'll create a cleanup flag with timestamp + state (e.g.
>>> 'queued')
>>> 
>>> in vm_stop_cleanup we change/create the flag with
>>> 'started' and clear the flag after cleanup
>> Why is the one in vm_stop needed? Is there any advantage over creating
>> it directly in vm_stop_cleanup()?
> 
> after a bit of experimenting and re-reading the code, i think
> I can simplify the logic
> 
> at the beginning of vm_stop, we create the cleanup flag
> in 'qm cleanup', we only do the cleanup if the flag does not exist
> in 'vm_start' we clean the flag
> 
> this should work because these parts are under a config lock anyway:
> * from vm_stop to vm_stop_cleanup
> * most of the qm cleanup code
> * vm_start
> 
> so we only really have to mark that the cleanup was done from
> the vm_stop codepath
> 
> (we have to create the flag at the beginning of vm_stop, because
> then there is no race between calling it's cleanup and qmeventd
> picking up the vanishing process)
> 
> does that make sense to you?
By patching qmeventd.c to wait for the process to close in a child process, by the time qm cleanup is ran, the process is closed, and the shutdown is normal, as it is with current code when no USB device is passed through. 

> 
>>> (if it's here already in 'started' state within a timelimit, ignore it)
>>> 
>>> in vm_start we block until the cleanup flag is gone or until some timeout
>>> 
>>> in 'qm cleanup' we only start it if the flag does not exist
>> Hmm, it does also call vm_stop_cleanup() so we could just re-use the
>> check there for that part? I guess doing an early check doesn't hurt
>> either, as long as we do call the post-stop hook.
>>> I think this should make the behavior consistent?
> 
> 
> 
> 
>

next prev parent reply	other threads:[~2026-02-23  7:42 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 11:15 Dominik Csapak
2026-02-12 20:33 ` Benjamin McGuire
2026-02-13 11:40 ` Fabian Grünbichler
2026-02-13 12:14 ` Fiona Ebner
2026-02-13 12:20   ` Fabian Grünbichler
2026-02-13 13:16     ` Fiona Ebner
2026-02-16  8:42       ` Fabian Grünbichler
2026-02-16  9:15         ` Fiona Ebner
2026-02-19 10:15           ` Dominik Csapak
2026-02-19 13:27             ` Fiona Ebner
2026-02-20  9:36               ` Dominik Csapak
2026-02-20 14:30                 ` Fiona Ebner
2026-02-20 14:51                   ` Dominik Csapak
2026-02-21  3:11                     ` Benjamin McGuire [this message]
2026-02-23  9:48                       ` Fiona Ebner
2026-02-23 10:27                     ` Fiona Ebner
2026-02-13 12:22   ` Dominik Csapak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=A06B1824-27F4-4410-BA0A-D0FF6D6E5B21@gmail.com \
    --to=jaminmc@gmail.com \
    --cc=d.csapak@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal