From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 416711FF13F for ; Thu, 12 Mar 2026 10:49:14 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 3ED31C998; Thu, 12 Mar 2026 10:49:10 +0100 (CET) Date: Thu, 12 Mar 2026 10:48:33 +0100 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= Subject: Re: [PATCH pve-common 1/1] RESTEnvironment: periodically reap workers as SIGCHLD fallback To: Hannes Laimer , pve-devel@lists.proxmox.com References: <20260304134649.82272-1-h.laimer@proxmox.com> <20260304134649.82272-2-h.laimer@proxmox.com> In-Reply-To: <20260304134649.82272-2-h.laimer@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid) Message-Id: <1773305738.33dnx84kzu.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1773308881123 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.054 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_MSPIKE_H2 0.001 Average reputation (+2) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: IQT5J6JW3AN46VZGMSK4TOOYZYOAAERI X-Message-ID-Hash: IQT5J6JW3AN46VZGMSK4TOOYZYOAAERI X-MailFrom: f.gruenbichler@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Wolfgang Bumiller X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On March 4, 2026 2:46 pm, Hannes Laimer wrote: > Libraries may temporarily override $SIG{CHLD}, causing worker exit > signals to be lost. Poll worker_reaper every 5 seconds via an AnyEvent > timer to catch any missed signals in API server contexts where this can > be problematic. >=20 > Signed-off-by: Hannes Laimer > --- > src/PVE/RESTEnvironment.pm | 9 +++++++++ > 1 file changed, 9 insertions(+) >=20 > diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm > index 81d7e29..cb44823 100644 > --- a/src/PVE/RESTEnvironment.pm > +++ b/src/PVE/RESTEnvironment.pm > @@ -37,6 +37,7 @@ my $rest_env; > =20 > my $WORKER_PIDS; > my $WORKER_FLAG =3D 0; > +my $worker_reaper_timer; > =20 > my $log_task_result =3D sub { > my ($upid, $user, $status) =3D @_; > @@ -124,6 +125,14 @@ sub init { > } > }; > =20 > + # Periodically reap workers as a fallback in case a library call tem= porarily overrides > + # $SIG{CHLD} and causes us to miss a signal. Only useful when an eve= nt loop is running. I'd replace the "Only useful" with something like "Only runs", makes it mor= e explicit. AFAICT, this not only affects SIGCHLD being redirected, but also SIGCHLD triggering while we are processing a previous SIGCHLD (if we checked that particular PID already while iterating, but before it exited). we could maybe consider to change worker_reaper to do my $dead; do { $dead =3D waitpid(-1, WNOHANG)) if (my $info =3D $WORKER_PIDS->{$dead}) { # log and delete } } while $dead > 0; that way, it should be faster/cheaper and handle other (transitive, non-worker) zombies that might have been missed by whatever forked them? we do have quite a few places where we explicitly wait for a specific child though, that need to be checked because they might break as a result.. so in the end the best way to make this better is maybe to use pidfds via AnyEvent if we have an event loop, to get notified and handle workers exiting without relying on SIGCHLD at all.. for CLI usage, we already waitpid the forked worker directly anyway (so should not have a zombie problem, and that flow could switch to a pidfd as well).. there's two fork_worker calls that set the background flag, but I don't even see how they'd be invoked in a CLI environment.. > + $worker_reaper_timer =3D AnyEvent->timer( > + after =3D> 5, > + interval =3D> 5, > + cb =3D> sub { $worker_reaper->() }, worker_reaper isn't that cheap atm, I don't think we want to run it every 5s.. on a busy server with lots of tasks, this will do a syscall per registered task worker.. since this is "just" the fallback for the rare case of "somehow we missed our signal", doing it once every minute or so should be enough? I think we can apply this with the timer interval set to a more sensible value, and the comments slightly adapted.. > + ); > + > # environment types > # cli ... command started fron command line > # pub ... access from public server (pveproxy) > --=20 > 2.47.3 >=20 >=20 >=20 >=20 >=20 >=20