From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Hannes Laimer <h.laimer@proxmox.com>, pve-devel@lists.proxmox.com
Subject: Re: [PATCH pve-common] RESTEnvironment: fix possible race in `register_worker`
Date: Wed, 04 Mar 2026 10:57:23 +0100 [thread overview]
Message-ID: <1772617908.i4bmsyq0kp.astroid@yuna.none> (raw)
In-Reply-To: <88a31fde-949b-4e3c-9561-f68c6cedc28b@proxmox.com>
On March 3, 2026 9:37 am, Hannes Laimer wrote:
> On 2026-03-03 09:24, Fabian Grünbichler wrote:
>> On March 3, 2026 8:15 am, Hannes Laimer wrote:
>>> If the worker finishes right after we `waitpid` but before we add it to
>>> `WORKER_PIDS` the `worker_reaper` won't `waitpid` it cause it iterates
>>> over `WORKER_PIDS`. So
>>
>> it would be interesting to get more details how this happens in practice
>> (with your reproducer)?
>>
>
> I do have a reproducer for task processes sticking around as zombies
> when they are done, but this change unfortunately did not fix that. I
> just noticed this in the process of finding the cause for the "original"
> problem, so I guess this is not a problem in practice, cause of the
> tight timings? But technically it would be possible (I think)
so we figured that one out in the meantime.. and it is probably best to
fix that issue by revamping the worker tracking entirely, both to fix
the bug and to improve performance/reduce overhead.
I still think we want to improve the error handling during forking, and
that this patch here doesn't actually fix anything substantial other
than temporary zombies if the worker terminates during setup.
it shouldn't hurt either though..
>> the sequence when forking a worker is:
>> - fork
>> - child executes some setup code
>> - child tells parent it is ready
>> - child waits for parent to tell it it can continue
>>
>> register_worker is called by the parent in between the last two steps
>> (after receiving the notification form the child, but before sending the
>> notification to the child), so why does the child disappear inbetween?
>>
>> I think this might actually (also?) be missing error handling in
>> fork_worker? all the POSIX::close/read/write calls there don't check for
>> failure, which means we attempt to register a worker that has already
>> failed at that point?
>>
>
> could be, but I don't think that should influence if a `SIGCHLD` is sent
> when the child is done? Cause the handler for `SIGCHLD` in the parent is
> never called...
> I'll take a look at that, thanks for the pointer!
>
>> and, somewhat tangentially related - should we switch this code over to
>> use pidfds and waitid to close PID reuse races?
>>
>
> @Wolfgang also mentioned that, would probably make sense
prev parent reply other threads:[~2026-03-04 9:56 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-03 7:15 Hannes Laimer
2026-03-03 8:24 ` Fabian Grünbichler
2026-03-03 8:37 ` Hannes Laimer
2026-03-04 9:57 ` Fabian Grünbichler [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1772617908.i4bmsyq0kp.astroid@yuna.none \
--to=f.gruenbichler@proxmox.com \
--cc=h.laimer@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.