From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id DABF41FF13F for ; Thu, 26 Feb 2026 11:33:29 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 4FE6B1BCD0; Thu, 26 Feb 2026 11:34:26 +0100 (CET) From: Hannes Laimer To: pve-devel@lists.proxmox.com Subject: [PATCH pve-common] REST environment: `waitpid` directly on SIGCHLD Date: Thu, 26 Feb 2026 11:33:48 +0100 Message-ID: <20260226103348.37202-1-h.laimer@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1772102013644 X-SPAM-LEVEL: Spam detection results: 0 AWL -1.004 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.618 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.734 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.78 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: B2IXHFRPD475HMT2JCVDNVSOJLNBUQF7 X-Message-ID-Hash: B2IXHFRPD475HMT2JCVDNVSOJLNBUQF7 X-MailFrom: h.laimer@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: If we are in an API server that uses AnyEvent, we currently postpone the task cleanup because the `active_workers` call may lead to a problem described in [1]. This, however, is only relevant to the `log_task_result` part of the `worker_reaper` sub, since that calls `active_workers`. The problem with postponing `waitpid` is that our task status endpoint checks the proc file for the PID, and without the `waitpid` call, the process sticks around as a zombie (`Z`). This leads to the status being reported as 'running' even though it is not. If the API is called rapidly, this postponement is pushed back potentially indefinitely. This is generally problematic, but specifically for a use-case where a task is started and the `/status` endpoint is polled to check for task completion. In that case, if the polling is fast enough, it prevents the cleanup and therefore the correct status reporting. I could reproduce this pretty consistently with: ```perl my $upid = self->client()->put('/some/task/starting/endpoint', {}); do { $res = $self->client()->get("/nodes/localhost/tasks/$upid/status"); # sleep 1 } while ($res->{status} eq "running"); ``` Adding a (long enough) sleep prevented the problem. With `0.5`, the task was correctly reported as done about 30-60s after it actually finished; with `0.1`, it took a couple of minutes. Without a `sleep`, tasks would never (>5-10 mins) be reported as done. [1] 6870afa4 ("RESTEnvironment: better SIGCHLD handling in AnyEvent event loop") Signed-off-by: Hannes Laimer --- src/PVE/RESTEnvironment.pm | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm index 81d7e29..4ed5c05 100644 --- a/src/PVE/RESTEnvironment.pm +++ b/src/PVE/RESTEnvironment.pm @@ -78,7 +78,16 @@ my $worker_reaper = sub { if (defined($waitpid) && ($waitpid == $pid)) { my $info = $WORKER_PIDS->{$pid}; if ($info && $info->{upid} && $info->{user}) { - &$log_task_result($info->{upid}, $info->{user}, $?); + # when we're using AnyEvent, we have to postpone the call to log_task_result, otherwise it + # might interfere with running api calls + my $exit_code = $?; + if (defined($AnyEvent::MODEL)) { + AnyEvent::postpone { + $log_task_result->($info->{upid}, $info->{user}, $exit_code) + }; + } else { + $log_task_result->($info->{upid}, $info->{user}, $exit_code); + } } delete($WORKER_PIDS->{$pid}); } @@ -115,13 +124,7 @@ sub init { if !$type || $type !~ m/^(cli|pub|priv|ha)$/; $SIG{CHLD} = sub { - # when we're using AnyEvent, we have to postpone the call to worker_reaper, otherwise it - # might interfere with running api calls - if (defined($AnyEvent::MODEL)) { - AnyEvent::postpone { $worker_reaper->() }; - } else { - $worker_reaper->(); - } + $worker_reaper->(); }; # environment types -- 2.47.3