From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 092561FF137 for ; Tue, 03 Mar 2026 09:37:14 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C2953F7B8; Tue, 3 Mar 2026 09:38:16 +0100 (CET) Message-ID: <88a31fde-949b-4e3c-9561-f68c6cedc28b@proxmox.com> Date: Tue, 3 Mar 2026 09:37:41 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH pve-common] RESTEnvironment: fix possible race in `register_worker` To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= , pve-devel@lists.proxmox.com References: <20260303071526.1150-1-h.laimer@proxmox.com> <1772525404.stk1gnguby.astroid@yuna.none> Content-Language: en-US From: Hannes Laimer In-Reply-To: <1772525404.stk1gnguby.astroid@yuna.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1772527037144 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.998 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.66 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.968 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.495 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: QTKNYGWPSMFEUWDCAXPSPP6YUKXOHXOW X-Message-ID-Hash: QTKNYGWPSMFEUWDCAXPSPP6YUKXOHXOW X-MailFrom: h.laimer@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 2026-03-03 09:24, Fabian Grünbichler wrote: > On March 3, 2026 8:15 am, Hannes Laimer wrote: >> If the worker finishes right after we `waitpid` but before we add it to >> `WORKER_PIDS` the `worker_reaper` won't `waitpid` it cause it iterates >> over `WORKER_PIDS`. So > > it would be interesting to get more details how this happens in practice > (with your reproducer)? > I do have a reproducer for task processes sticking around as zombies when they are done, but this change unfortunately did not fix that. I just noticed this in the process of finding the cause for the "original" problem, so I guess this is not a problem in practice, cause of the tight timings? But technically it would be possible (I think) > the sequence when forking a worker is: > - fork > - child executes some setup code > - child tells parent it is ready > - child waits for parent to tell it it can continue > > register_worker is called by the parent in between the last two steps > (after receiving the notification form the child, but before sending the > notification to the child), so why does the child disappear inbetween? > > I think this might actually (also?) be missing error handling in > fork_worker? all the POSIX::close/read/write calls there don't check for > failure, which means we attempt to register a worker that has already > failed at that point? > could be, but I don't think that should influence if a `SIGCHLD` is sent when the child is done? Cause the handler for `SIGCHLD` in the parent is never called... I'll take a look at that, thanks for the pointer! > and, somewhat tangentially related - should we switch this code over to > use pidfds and waitid to close PID reuse races? > @Wolfgang also mentioned that, would probably make sense >> - the clean-up triggered by the SIGCHLD won't catch it cause it needs it to >> be in `WORKER_PIDS` >> - and, `register_worker` won't because it was still running when it >> `waitpid`'ed it >> >> Moving the insertion into `WORKER_PIDS` before the `waitpid` solves >> this by making sure it is >> - always in the var for `worker_reaper` >> - and, if SIGCHILD should trigger `worker_reaper` before we add it to >> `WORKER_PIDS`, the `waitpid` in `register_worker` itself will catch >> it >> >> Signed-off-by: Hannes Laimer >> --- >> src/PVE/RESTEnvironment.pm | 11 ++++++----- >> 1 file changed, 6 insertions(+), 5 deletions(-) >> >> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm >> index 4ed5c05..4677687 100644 >> --- a/src/PVE/RESTEnvironment.pm >> +++ b/src/PVE/RESTEnvironment.pm >> @@ -99,17 +99,18 @@ my $register_worker = sub { >> >> return if !$pid; >> >> - # do not register if already finished >> + $WORKER_PIDS->{$pid} = { >> + user => $user, >> + upid => $upid, >> + }; >> + >> + # remove immediately if already finished >> my $waitpid = waitpid($pid, WNOHANG); >> if (defined($waitpid) && ($waitpid == $pid)) { >> delete($WORKER_PIDS->{$pid}); >> return; >> } >> >> - $WORKER_PIDS->{$pid} = { >> - user => $user, >> - upid => $upid, >> - }; >> }; >> >> # initialize environment - must be called once at program startup >> -- >> 2.47.3 >> >> >> >> >> >>