From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 7A90F1FF138 for ; Wed, 04 Mar 2026 10:56:25 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id DE8375DFB; Wed, 4 Mar 2026 10:57:29 +0100 (CET) Date: Wed, 04 Mar 2026 10:57:23 +0100 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= Subject: Re: [PATCH pve-common] RESTEnvironment: fix possible race in `register_worker` To: Hannes Laimer , pve-devel@lists.proxmox.com References: <20260303071526.1150-1-h.laimer@proxmox.com> <1772525404.stk1gnguby.astroid@yuna.none> <88a31fde-949b-4e3c-9561-f68c6cedc28b@proxmox.com> In-Reply-To: <88a31fde-949b-4e3c-9561-f68c6cedc28b@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid) Message-Id: <1772617908.i4bmsyq0kp.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1772618220621 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.049 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: PMHTQ456K5K7GMHCRDH2KCIFORHSEWZL X-Message-ID-Hash: PMHTQ456K5K7GMHCRDH2KCIFORHSEWZL X-MailFrom: f.gruenbichler@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On March 3, 2026 9:37 am, Hannes Laimer wrote: > On 2026-03-03 09:24, Fabian Gr=C3=BCnbichler wrote: >> On March 3, 2026 8:15 am, Hannes Laimer wrote: >>> If the worker finishes right after we `waitpid` but before we add it to >>> `WORKER_PIDS` the `worker_reaper` won't `waitpid` it cause it iterates >>> over `WORKER_PIDS`. So >>=20 >> it would be interesting to get more details how this happens in practice >> (with your reproducer)? >>=20 >=20 > I do have a reproducer for task processes sticking around as zombies > when they are done, but this change unfortunately did not fix that. I > just noticed this in the process of finding the cause for the "original" > problem, so I guess this is not a problem in practice, cause of the > tight timings? But technically it would be possible (I think) so we figured that one out in the meantime.. and it is probably best to fix that issue by revamping the worker tracking entirely, both to fix the bug and to improve performance/reduce overhead. I still think we want to improve the error handling during forking, and that this patch here doesn't actually fix anything substantial other than temporary zombies if the worker terminates during setup. it shouldn't hurt either though.. >> the sequence when forking a worker is: >> - fork >> - child executes some setup code >> - child tells parent it is ready >> - child waits for parent to tell it it can continue >>=20 >> register_worker is called by the parent in between the last two steps >> (after receiving the notification form the child, but before sending the >> notification to the child), so why does the child disappear inbetween? >>=20 >> I think this might actually (also?) be missing error handling in >> fork_worker? all the POSIX::close/read/write calls there don't check for >> failure, which means we attempt to register a worker that has already >> failed at that point? >>=20 >=20 > could be, but I don't think that should influence if a `SIGCHLD` is sent > when the child is done? Cause the handler for `SIGCHLD` in the parent is > never called... > I'll take a look at that, thanks for the pointer! >=20 >> and, somewhat tangentially related - should we switch this code over to >> use pidfds and waitid to close PID reuse races? >>=20 >=20 > @Wolfgang also mentioned that, would probably make sense