public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten
@ 2026-03-04 13:46 Hannes Laimer
  2026-03-04 13:46 ` [PATCH pve-common 1/1] RESTEnvironment: periodically reap workers as SIGCHLD fallback Hannes Laimer
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Hannes Laimer @ 2026-03-04 13:46 UTC (permalink / raw)
  To: pve-devel

Thanks a lot @Fabian and @Fiona for helping me debug this!

The problem is that some libaries do overwrite the SIGCHLD handler
temporarily, if the library is called fast enough this can lead to lost
CHLD signals which in turn prevents `worker_reaper` from being called in
RESTEnvironment. So tasks won't get cleaned-up until a different SIGCHLD
arrives at the same `pvedeamon` process triggering `worker_reaper`.

As @Fabian mentioned in [1] a general re-work of the task handling,
potentially with `pidfd`s, would make a lot of sense.

These two patches address the problem in the task handling structure as
it currently is. They
 - run the PAM lib call in a fork, so signal handler changes the library
   does are isloated from our process
 - run `worker_reaper` periodically (5s) do catch any other potential
   instances of this, since it would be possible that the same happens
   with other libs, not just PAM

[1] https://lore.proxmox.com/pve-devel/1772617908.i4bmsyq0kp.astroid@yuna.none/T/#m7b0f3873be5755f330e288cfa50905744f225b2b


pve-common:

Hannes Laimer (1):
  RESTEnvironment: periodically reap workers as SIGCHLD fallback

 src/PVE/RESTEnvironment.pm | 9 +++++++++
 1 file changed, 9 insertions(+)


pve-access-control:

Hannes Laimer (1):
  pam: fork for PAM authentication to isolate SIGCHLD handler

 src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 32 deletions(-)


Summary over all repositories:
  2 files changed, 51 insertions(+), 32 deletions(-)

-- 
Generated by murpp 0.9.0




^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH pve-common 1/1] RESTEnvironment: periodically reap workers as SIGCHLD fallback
  2026-03-04 13:46 [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten Hannes Laimer
@ 2026-03-04 13:46 ` Hannes Laimer
  2026-03-12  9:48   ` Fabian Grünbichler
  2026-03-04 13:46 ` [PATCH pve-access-control 1/1] pam: fork for PAM authentication to isolate SIGCHLD handler Hannes Laimer
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Hannes Laimer @ 2026-03-04 13:46 UTC (permalink / raw)
  To: pve-devel

Libraries may temporarily override $SIG{CHLD}, causing worker exit
signals to be lost. Poll worker_reaper every 5 seconds via an AnyEvent
timer to catch any missed signals in API server contexts where this can
be problematic.

Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
---
 src/PVE/RESTEnvironment.pm | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
index 81d7e29..cb44823 100644
--- a/src/PVE/RESTEnvironment.pm
+++ b/src/PVE/RESTEnvironment.pm
@@ -37,6 +37,7 @@ my $rest_env;
 
 my $WORKER_PIDS;
 my $WORKER_FLAG = 0;
+my $worker_reaper_timer;
 
 my $log_task_result = sub {
     my ($upid, $user, $status) = @_;
@@ -124,6 +125,14 @@ sub init {
         }
     };
 
+    # Periodically reap workers as a fallback in case a library call temporarily overrides
+    # $SIG{CHLD} and causes us to miss a signal. Only useful when an event loop is running.
+    $worker_reaper_timer = AnyEvent->timer(
+        after => 5,
+        interval => 5,
+        cb => sub { $worker_reaper->() },
+    );
+
     # environment types
     # cli  ... command started fron command line
     # pub  ... access from public server (pveproxy)
-- 
2.47.3





^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH pve-access-control 1/1] pam: fork for PAM authentication to isolate SIGCHLD handler
  2026-03-04 13:46 [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten Hannes Laimer
  2026-03-04 13:46 ` [PATCH pve-common 1/1] RESTEnvironment: periodically reap workers as SIGCHLD fallback Hannes Laimer
@ 2026-03-04 13:46 ` Hannes Laimer
  2026-03-12  9:44   ` Fabian Grünbichler
  2026-03-06 17:16 ` [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten Stefan Hanreich
  2026-03-11 14:56 ` Michael Köppl
  3 siblings, 1 reply; 7+ messages in thread
From: Hannes Laimer @ 2026-03-04 13:46 UTC (permalink / raw)
  To: pve-devel

PAM modules can temporarily override $SIG{CHLD}, causing SIGCHLDs from
RESTEnvironment worker processes to be lost. Run the PAM interaction in
a subprocess via PVE::Tools::run_fork to contain any signal handler
manipulation to the child.

Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
---
 src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/src/PVE/Auth/PAM.pm b/src/PVE/Auth/PAM.pm
index 3aacfc0..8586da5 100755
--- a/src/PVE/Auth/PAM.pm
+++ b/src/PVE/Auth/PAM.pm
@@ -27,45 +27,55 @@ sub authenticate_user {
     # user (www-data) need to be able to read /etc/passwd /etc/shadow
     die "no password\n" if !$password;
 
-    my $pamh = Authen::PAM->new(
-        'proxmox-ve-auth',
-        $username,
-        sub {
-            my @res;
-            while (@_) {
-                my $msg_type = shift;
-                my $msg = shift;
-                push @res, (0, $password);
-            }
-            push @res, 0;
-            return @res;
-        },
-    );
-
-    if (!ref($pamh)) {
-        my $err = $pamh->pam_strerror($pamh);
-        die "error during PAM init: $err";
+    # PAM modules may temporarily override $SIG{CHLD}, causing SIGCHLDs from
+    # RESTEnvironment workers to be lost. Running the PAM interaction in a fork
+    # isolates any such handler manipulation from the parent process.
+    my $client_ip;
+    if (my $rpcenv = PVE::RPCEnvironment::get()) {
+        $client_ip = $rpcenv->get_client_ip();
     }
 
-    if (my $rpcenv = PVE::RPCEnvironment::get()) {
-        if (my $ip = $rpcenv->get_client_ip()) {
-            $pamh->pam_set_item(PAM_RHOST(), $ip);
+    PVE::Tools::run_fork(sub {
+        my $pamh = Authen::PAM->new(
+            'proxmox-ve-auth',
+            $username,
+            sub {
+                my @res;
+                while (@_) {
+                    my $msg_type = shift;
+                    my $msg = shift;
+                    push @res, (0, $password);
+                }
+                push @res, 0;
+                return @res;
+            },
+        );
+
+        if (!ref($pamh)) {
+            my $err = $pamh->pam_strerror($pamh);
+            die "error during PAM init: $err";
         }
-    }
 
-    my $res;
+        if ($client_ip) {
+            $pamh->pam_set_item(PAM_RHOST(), $client_ip);
+        }
 
-    if (($res = $pamh->pam_authenticate(0)) != PAM_SUCCESS) {
-        my $err = $pamh->pam_strerror($res);
-        die "$err\n";
-    }
+        my $res;
 
-    if (($res = $pamh->pam_acct_mgmt(0)) != PAM_SUCCESS) {
-        my $err = $pamh->pam_strerror($res);
-        die "$err\n";
-    }
+        if (($res = $pamh->pam_authenticate(0)) != PAM_SUCCESS) {
+            my $err = $pamh->pam_strerror($res);
+            die "$err\n";
+        }
+
+        if (($res = $pamh->pam_acct_mgmt(0)) != PAM_SUCCESS) {
+            my $err = $pamh->pam_strerror($res);
+            die "$err\n";
+        }
+
+        $pamh = 0; # call destructor
 
-    $pamh = 0; # call destructor
+        return 1;
+    });
 
     return 1;
 }
-- 
2.47.3





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten
  2026-03-04 13:46 [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten Hannes Laimer
  2026-03-04 13:46 ` [PATCH pve-common 1/1] RESTEnvironment: periodically reap workers as SIGCHLD fallback Hannes Laimer
  2026-03-04 13:46 ` [PATCH pve-access-control 1/1] pam: fork for PAM authentication to isolate SIGCHLD handler Hannes Laimer
@ 2026-03-06 17:16 ` Stefan Hanreich
  2026-03-11 14:56 ` Michael Köppl
  3 siblings, 0 replies; 7+ messages in thread
From: Stefan Hanreich @ 2026-03-06 17:16 UTC (permalink / raw)
  To: Hannes Laimer, pve-devel

Applied this today while developing my integration tests and haven't
encountered issues w.r.t tasks hanging since then on my test instances.

Consider this:
Tested-by: Stefan Hanreich <s.hanreich@proxmox.com>

On 3/4/26 2:47 PM, Hannes Laimer wrote:
> Thanks a lot @Fabian and @Fiona for helping me debug this!
> 
> The problem is that some libaries do overwrite the SIGCHLD handler
> temporarily, if the library is called fast enough this can lead to lost
> CHLD signals which in turn prevents `worker_reaper` from being called in
> RESTEnvironment. So tasks won't get cleaned-up until a different SIGCHLD
> arrives at the same `pvedeamon` process triggering `worker_reaper`.
> 
> As @Fabian mentioned in [1] a general re-work of the task handling,
> potentially with `pidfd`s, would make a lot of sense.
> 
> These two patches address the problem in the task handling structure as
> it currently is. They
>  - run the PAM lib call in a fork, so signal handler changes the library
>    does are isloated from our process
>  - run `worker_reaper` periodically (5s) do catch any other potential
>    instances of this, since it would be possible that the same happens
>    with other libs, not just PAM
> 
> [1] https://lore.proxmox.com/pve-devel/1772617908.i4bmsyq0kp.astroid@yuna.none/T/#m7b0f3873be5755f330e288cfa50905744f225b2b
> 
> 
> pve-common:
> 
> Hannes Laimer (1):
>   RESTEnvironment: periodically reap workers as SIGCHLD fallback
> 
>  src/PVE/RESTEnvironment.pm | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> 
> pve-access-control:
> 
> Hannes Laimer (1):
>   pam: fork for PAM authentication to isolate SIGCHLD handler
> 
>  src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++--------------------
>  1 file changed, 42 insertions(+), 32 deletions(-)
> 
> 
> Summary over all repositories:
>   2 files changed, 51 insertions(+), 32 deletions(-)
> 





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten
  2026-03-04 13:46 [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten Hannes Laimer
                   ` (2 preceding siblings ...)
  2026-03-06 17:16 ` [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten Stefan Hanreich
@ 2026-03-11 14:56 ` Michael Köppl
  3 siblings, 0 replies; 7+ messages in thread
From: Michael Köppl @ 2026-03-11 14:56 UTC (permalink / raw)
  To: Hannes Laimer, pve-devel

I also encountered hanging tasks while running e2e tests, often leading
to tests running into timeouts even if the task was already "OK". I
applied these patches to the test VMs and did not encounter problems
with hanging tasks anymore, significantly speeding up the test runs.

Consider this:
Tested-by: Michael Köppl <m.koeppl@proxmox.com>

On Wed Mar 4, 2026 at 2:46 PM CET, Hannes Laimer wrote:
> Thanks a lot @Fabian and @Fiona for helping me debug this!
>
> The problem is that some libaries do overwrite the SIGCHLD handler
> temporarily, if the library is called fast enough this can lead to lost
> CHLD signals which in turn prevents `worker_reaper` from being called in
> RESTEnvironment. So tasks won't get cleaned-up until a different SIGCHLD
> arrives at the same `pvedeamon` process triggering `worker_reaper`.
>
> As @Fabian mentioned in [1] a general re-work of the task handling,
> potentially with `pidfd`s, would make a lot of sense.
>
> These two patches address the problem in the task handling structure as
> it currently is. They
>  - run the PAM lib call in a fork, so signal handler changes the library
>    does are isloated from our process
>  - run `worker_reaper` periodically (5s) do catch any other potential
>    instances of this, since it would be possible that the same happens
>    with other libs, not just PAM
>
> [1] https://lore.proxmox.com/pve-devel/1772617908.i4bmsyq0kp.astroid@yuna.none/T/#m7b0f3873be5755f330e288cfa50905744f225b2b
>
>
> pve-common:
>
> Hannes Laimer (1):
>   RESTEnvironment: periodically reap workers as SIGCHLD fallback
>
>  src/PVE/RESTEnvironment.pm | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
>
> pve-access-control:
>
> Hannes Laimer (1):
>   pam: fork for PAM authentication to isolate SIGCHLD handler
>
>  src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++--------------------
>  1 file changed, 42 insertions(+), 32 deletions(-)
>
>
> Summary over all repositories:
>   2 files changed, 51 insertions(+), 32 deletions(-)





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH pve-access-control 1/1] pam: fork for PAM authentication to isolate SIGCHLD handler
  2026-03-04 13:46 ` [PATCH pve-access-control 1/1] pam: fork for PAM authentication to isolate SIGCHLD handler Hannes Laimer
@ 2026-03-12  9:44   ` Fabian Grünbichler
  0 siblings, 0 replies; 7+ messages in thread
From: Fabian Grünbichler @ 2026-03-12  9:44 UTC (permalink / raw)
  To: Hannes Laimer, pve-devel

On March 4, 2026 2:46 pm, Hannes Laimer wrote:
> PAM modules can temporarily override $SIG{CHLD}, causing SIGCHLDs from
> RESTEnvironment worker processes to be lost. Run the PAM interaction in
> a subprocess via PVE::Tools::run_fork to contain any signal handler
> manipulation to the child.
> 
> Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Tested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>

this patch alone fixes the reproducer for me, tracing the signals also
clearly shows that the PAM unix_chkpwd children are now decoupled
properly from our worker handling..

> ---
>  src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++--------------------
>  1 file changed, 42 insertions(+), 32 deletions(-)
> 
> diff --git a/src/PVE/Auth/PAM.pm b/src/PVE/Auth/PAM.pm
> index 3aacfc0..8586da5 100755
> --- a/src/PVE/Auth/PAM.pm
> +++ b/src/PVE/Auth/PAM.pm
> @@ -27,45 +27,55 @@ sub authenticate_user {
>      # user (www-data) need to be able to read /etc/passwd /etc/shadow
>      die "no password\n" if !$password;
>  
> -    my $pamh = Authen::PAM->new(
> -        'proxmox-ve-auth',
> -        $username,
> -        sub {
> -            my @res;
> -            while (@_) {
> -                my $msg_type = shift;
> -                my $msg = shift;
> -                push @res, (0, $password);
> -            }
> -            push @res, 0;
> -            return @res;
> -        },
> -    );
> -
> -    if (!ref($pamh)) {
> -        my $err = $pamh->pam_strerror($pamh);
> -        die "error during PAM init: $err";
> +    # PAM modules may temporarily override $SIG{CHLD}, causing SIGCHLDs from
> +    # RESTEnvironment workers to be lost. Running the PAM interaction in a fork
> +    # isolates any such handler manipulation from the parent process.
> +    my $client_ip;
> +    if (my $rpcenv = PVE::RPCEnvironment::get()) {
> +        $client_ip = $rpcenv->get_client_ip();
>      }
>  
> -    if (my $rpcenv = PVE::RPCEnvironment::get()) {
> -        if (my $ip = $rpcenv->get_client_ip()) {
> -            $pamh->pam_set_item(PAM_RHOST(), $ip);
> +    PVE::Tools::run_fork(sub {
> +        my $pamh = Authen::PAM->new(
> +            'proxmox-ve-auth',
> +            $username,
> +            sub {
> +                my @res;
> +                while (@_) {
> +                    my $msg_type = shift;
> +                    my $msg = shift;
> +                    push @res, (0, $password);
> +                }
> +                push @res, 0;
> +                return @res;
> +            },
> +        );
> +
> +        if (!ref($pamh)) {
> +            my $err = $pamh->pam_strerror($pamh);
> +            die "error during PAM init: $err";
>          }
> -    }
>  
> -    my $res;
> +        if ($client_ip) {
> +            $pamh->pam_set_item(PAM_RHOST(), $client_ip);
> +        }
>  
> -    if (($res = $pamh->pam_authenticate(0)) != PAM_SUCCESS) {
> -        my $err = $pamh->pam_strerror($res);
> -        die "$err\n";
> -    }
> +        my $res;
>  
> -    if (($res = $pamh->pam_acct_mgmt(0)) != PAM_SUCCESS) {
> -        my $err = $pamh->pam_strerror($res);
> -        die "$err\n";
> -    }
> +        if (($res = $pamh->pam_authenticate(0)) != PAM_SUCCESS) {
> +            my $err = $pamh->pam_strerror($res);
> +            die "$err\n";
> +        }
> +
> +        if (($res = $pamh->pam_acct_mgmt(0)) != PAM_SUCCESS) {
> +            my $err = $pamh->pam_strerror($res);
> +            die "$err\n";
> +        }
> +
> +        $pamh = 0; # call destructor
>  
> -    $pamh = 0; # call destructor
> +        return 1;
> +    });
>  
>      return 1;
>  }
> -- 
> 2.47.3
> 
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH pve-common 1/1] RESTEnvironment: periodically reap workers as SIGCHLD fallback
  2026-03-04 13:46 ` [PATCH pve-common 1/1] RESTEnvironment: periodically reap workers as SIGCHLD fallback Hannes Laimer
@ 2026-03-12  9:48   ` Fabian Grünbichler
  0 siblings, 0 replies; 7+ messages in thread
From: Fabian Grünbichler @ 2026-03-12  9:48 UTC (permalink / raw)
  To: Hannes Laimer, pve-devel; +Cc: Wolfgang Bumiller

On March 4, 2026 2:46 pm, Hannes Laimer wrote:
> Libraries may temporarily override $SIG{CHLD}, causing worker exit
> signals to be lost. Poll worker_reaper every 5 seconds via an AnyEvent
> timer to catch any missed signals in API server contexts where this can
> be problematic.
> 
> Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
> ---
>  src/PVE/RESTEnvironment.pm | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
> index 81d7e29..cb44823 100644
> --- a/src/PVE/RESTEnvironment.pm
> +++ b/src/PVE/RESTEnvironment.pm
> @@ -37,6 +37,7 @@ my $rest_env;
>  
>  my $WORKER_PIDS;
>  my $WORKER_FLAG = 0;
> +my $worker_reaper_timer;
>  
>  my $log_task_result = sub {
>      my ($upid, $user, $status) = @_;
> @@ -124,6 +125,14 @@ sub init {
>          }
>      };
>  
> +    # Periodically reap workers as a fallback in case a library call temporarily overrides
> +    # $SIG{CHLD} and causes us to miss a signal. Only useful when an event loop is running.

I'd replace the "Only useful" with something like "Only runs", makes it more explicit.

AFAICT, this not only affects SIGCHLD being redirected, but also SIGCHLD
triggering while we are processing a previous SIGCHLD (if we checked
that particular PID already while iterating, but before it exited).

we could maybe consider to change worker_reaper to do

my $dead;
do {
  $dead = waitpid(-1, WNOHANG))
  if (my $info = $WORKER_PIDS->{$dead}) {
    # log and delete
  }
} while $dead > 0;

that way, it should be faster/cheaper and handle other (transitive,
non-worker) zombies that might have been missed by whatever forked them?

we do have quite a few places where we explicitly wait for a specific
child though, that need to be checked because they might break as a
result..

so in the end the best way to make this better is maybe to use pidfds
via AnyEvent if we have an event loop, to get notified and handle
workers exiting without relying on SIGCHLD at all..

for CLI usage, we already waitpid the forked worker directly anyway (so
should not have a zombie problem, and that flow could switch to a pidfd
as well).. there's two fork_worker calls that set the background flag,
but I don't even see how they'd be invoked in a CLI environment..

> +    $worker_reaper_timer = AnyEvent->timer(
> +        after => 5,
> +        interval => 5,
> +        cb => sub { $worker_reaper->() },

worker_reaper isn't that cheap atm, I don't think we want to run it
every 5s.. on a busy server with lots of tasks, this will do a syscall
per registered task worker..

since this is "just" the fallback for the rare case of "somehow we
missed our signal", doing it once every minute or so should be enough?

I think we can apply this with the timer interval set to a more sensible
value, and the comments slightly adapted..

> +    );
> +
>      # environment types
>      # cli  ... command started fron command line
>      # pub  ... access from public server (pveproxy)
> -- 
> 2.47.3
> 
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-12  9:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-04 13:46 [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten Hannes Laimer
2026-03-04 13:46 ` [PATCH pve-common 1/1] RESTEnvironment: periodically reap workers as SIGCHLD fallback Hannes Laimer
2026-03-12  9:48   ` Fabian Grünbichler
2026-03-04 13:46 ` [PATCH pve-access-control 1/1] pam: fork for PAM authentication to isolate SIGCHLD handler Hannes Laimer
2026-03-12  9:44   ` Fabian Grünbichler
2026-03-06 17:16 ` [PATCH access-control/common 0/2] address probblem with SIGCHLD handler being temporarily overwritten Stefan Hanreich
2026-03-11 14:56 ` Michael Köppl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal