* [pbs-devel] [PATCH proxmox v2 0/4] worker task setup improvements
@ 2024-12-02 13:04 Fabian Grünbichler
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 1/4] rest-server: handle failure in worker task setup correctly Fabian Grünbichler
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Fabian Grünbichler @ 2024-12-02 13:04 UTC (permalink / raw)
To: pbs-devel
This series fixes two issues related to reset server shutdown/reload and
worker task accounting.
issue 1 (patches 1, rfcs 3, 4):
if WorkerTask::new returned an error because updating the task indices
failed (for example, because the lock covering such updates was not
acquired because of a timeout), the task was registered in the worker
task list, but not returned to the caller, which means the task could
never actually execute and reach its cleanup/log_result stage, which
would unregister it again. effectively, in such a scenario the worker
task is "leaked", but the task count decrement never happens, which in
turn means the corresponding proxy can never shutdown, since it will
wait for the phantom task to finish forever.
this issue was actually found in the wild on a system with lots of
activity.
issue 2 (patch 2):
a lock scope issue could cause a temporary inconsistency between the
task list and task count, if multiple tasks log their result in
parallel. the discrepancy disappars with the next task that is created
or logs its result, since the count is always reset to the current count
and not incremented/decremented.
this issue was found while analyzing the code.
Fabian Grünbichler (4):
rest-server: handle failure in worker task setup correctly
rest-server: close race window when updating worker task count
rest-server: make worker task creation error handling more idiomatic
rest-server: increase task index lock timeout to 15s
proxmox-rest-server/src/worker_task.rs | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [PATCH proxmox v2 1/4] rest-server: handle failure in worker task setup correctly
2024-12-02 13:04 [pbs-devel] [PATCH proxmox v2 0/4] worker task setup improvements Fabian Grünbichler
@ 2024-12-02 13:04 ` Fabian Grünbichler
2024-12-02 15:57 ` [pbs-devel] applied: " Thomas Lamprecht
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 2/4] rest-server: close race window when updating worker task count Fabian Grünbichler
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Fabian Grünbichler @ 2024-12-02 13:04 UTC (permalink / raw)
To: pbs-devel
if setting up a new worker fails after it has been inserted into the
WORKER_TASK_LIST, we need to clean it up instead of bubbling up the error right
away, else we "leak" the worker task and it never finishes..
a worker task that never finishes will indefinitely block shutdown
of the rest server process, including the "old" process when reloading
the rest server.
this issue was found in the wild on a system with lock contention on the
file-based lock covering task index updating leading to lock acquiring
timeouts.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
proxmox-rest-server/src/worker_task.rs | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs
index 6e76c2ca..3ca93965 100644
--- a/proxmox-rest-server/src/worker_task.rs
+++ b/proxmox-rest-server/src/worker_task.rs
@@ -923,7 +923,12 @@ impl WorkerTask {
set_worker_count(hash.len());
}
- setup.update_active_workers(Some(&upid))?;
+ let res = setup.update_active_workers(Some(&upid));
+ if res.is_err() {
+ // needed to undo the insertion into WORKER_TASK_LIST above
+ worker.log_result(&res);
+ res?
+ }
Ok((worker, logger))
}
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [PATCH proxmox v2 2/4] rest-server: close race window when updating worker task count
2024-12-02 13:04 [pbs-devel] [PATCH proxmox v2 0/4] worker task setup improvements Fabian Grünbichler
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 1/4] rest-server: handle failure in worker task setup correctly Fabian Grünbichler
@ 2024-12-02 13:04 ` Fabian Grünbichler
2024-12-02 15:57 ` [pbs-devel] applied: " Thomas Lamprecht
2024-12-02 13:04 ` [pbs-devel] [RFC proxmox v2 3/4] rest-server: make worker task creation error handling more idiomatic Fabian Grünbichler
2024-12-02 13:04 ` [pbs-devel] [RFC proxmox v2 4/4] rest-server: increase task index lock timeout to 15s Fabian Grünbichler
3 siblings, 1 reply; 9+ messages in thread
From: Fabian Grünbichler @ 2024-12-02 13:04 UTC (permalink / raw)
To: pbs-devel
this mimics how the count is updated when spawning a new task - the lock scope
needs to cover the count update itself, else there's a race when multiple
worker's log their result at the same time..
Co-developed-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
v2: added more comments for current and future readers :)
proxmox-rest-server/src/worker_task.rs | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs
index 3ca93965..bbf82ff9 100644
--- a/proxmox-rest-server/src/worker_task.rs
+++ b/proxmox-rest-server/src/worker_task.rs
@@ -923,6 +923,7 @@ impl WorkerTask {
set_worker_count(hash.len());
}
+ // this wants to access WORKER_TASK_LIST, so we need to drop the lock above
let res = setup.update_active_workers(Some(&upid));
if res.is_err() {
// needed to undo the insertion into WORKER_TASK_LIST above
@@ -1022,8 +1023,11 @@ impl WorkerTask {
self.log_message(state.result_text());
WORKER_TASK_LIST.lock().unwrap().remove(&self.upid.task_id);
+ // this wants to access WORKER_TASK_LIST, so we need to drop the lock above
let _ = self.setup.update_active_workers(None);
- set_worker_count(WORKER_TASK_LIST.lock().unwrap().len());
+ // re-acquire the lock and hold it while updating the count
+ let lock = WORKER_TASK_LIST.lock().unwrap();
+ set_worker_count(lock.len());
}
/// Log a message.
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [RFC proxmox v2 3/4] rest-server: make worker task creation error handling more idiomatic
2024-12-02 13:04 [pbs-devel] [PATCH proxmox v2 0/4] worker task setup improvements Fabian Grünbichler
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 1/4] rest-server: handle failure in worker task setup correctly Fabian Grünbichler
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 2/4] rest-server: close race window when updating worker task count Fabian Grünbichler
@ 2024-12-02 13:04 ` Fabian Grünbichler
2024-12-02 15:58 ` Thomas Lamprecht
2024-12-02 13:04 ` [pbs-devel] [RFC proxmox v2 4/4] rest-server: increase task index lock timeout to 15s Fabian Grünbichler
3 siblings, 1 reply; 9+ messages in thread
From: Fabian Grünbichler @ 2024-12-02 13:04 UTC (permalink / raw)
To: pbs-devel
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
Notes:
best I could come up with ;) feel free to fold that one in
proxmox-rest-server/src/worker_task.rs | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs
index bbf82ff9..74828855 100644
--- a/proxmox-rest-server/src/worker_task.rs
+++ b/proxmox-rest-server/src/worker_task.rs
@@ -928,10 +928,10 @@ impl WorkerTask {
if res.is_err() {
// needed to undo the insertion into WORKER_TASK_LIST above
worker.log_result(&res);
- res?
+ Err(res.unwrap_err())
+ } else {
+ Ok((worker, logger))
}
-
- Ok((worker, logger))
}
/// Spawn a new tokio task/future.
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [RFC proxmox v2 4/4] rest-server: increase task index lock timeout to 15s
2024-12-02 13:04 [pbs-devel] [PATCH proxmox v2 0/4] worker task setup improvements Fabian Grünbichler
` (2 preceding siblings ...)
2024-12-02 13:04 ` [pbs-devel] [RFC proxmox v2 3/4] rest-server: make worker task creation error handling more idiomatic Fabian Grünbichler
@ 2024-12-02 13:04 ` Fabian Grünbichler
2024-12-02 15:59 ` [pbs-devel] applied: " Thomas Lamprecht
3 siblings, 1 reply; 9+ messages in thread
From: Fabian Grünbichler @ 2024-12-02 13:04 UTC (permalink / raw)
To: pbs-devel
this lock can be quite contended, until the surrounding code is properly split
to reduce this contention it should help to give the worker task
creation/cleanup code a bit more breathing room.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
proxmox-rest-server/src/worker_task.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs
index 74828855..3806159c 100644
--- a/proxmox-rest-server/src/worker_task.rs
+++ b/proxmox-rest-server/src/worker_task.rs
@@ -139,7 +139,7 @@ impl WorkerTaskSetup {
.clone()
.perm(nix::sys::stat::Mode::from_bits_truncate(0o660));
- let timeout = std::time::Duration::new(10, 0);
+ let timeout = std::time::Duration::new(15, 0);
let file =
proxmox_sys::fs::open_file_locked(&self.task_lock_fn, timeout, exclusive, options)?;
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] applied: [PATCH proxmox v2 1/4] rest-server: handle failure in worker task setup correctly
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 1/4] rest-server: handle failure in worker task setup correctly Fabian Grünbichler
@ 2024-12-02 15:57 ` Thomas Lamprecht
0 siblings, 0 replies; 9+ messages in thread
From: Thomas Lamprecht @ 2024-12-02 15:57 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Fabian Grünbichler
Am 02.12.24 um 14:04 schrieb Fabian Grünbichler:
> if setting up a new worker fails after it has been inserted into the
> WORKER_TASK_LIST, we need to clean it up instead of bubbling up the error right
> away, else we "leak" the worker task and it never finishes..
>
> a worker task that never finishes will indefinitely block shutdown
> of the rest server process, including the "old" process when reloading
> the rest server.
>
> this issue was found in the wild on a system with lock contention on the
> file-based lock covering task index updating leading to lock acquiring
> timeouts.
>
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> proxmox-rest-server/src/worker_task.rs | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
>
applied, thanks!
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] applied: [PATCH proxmox v2 2/4] rest-server: close race window when updating worker task count
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 2/4] rest-server: close race window when updating worker task count Fabian Grünbichler
@ 2024-12-02 15:57 ` Thomas Lamprecht
0 siblings, 0 replies; 9+ messages in thread
From: Thomas Lamprecht @ 2024-12-02 15:57 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Fabian Grünbichler
Am 02.12.24 um 14:04 schrieb Fabian Grünbichler:
> this mimics how the count is updated when spawning a new task - the lock scope
> needs to cover the count update itself, else there's a race when multiple
> worker's log their result at the same time..
>
> Co-developed-by: Dominik Csapak <d.csapak@proxmox.com>
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> v2: added more comments for current and future readers :)
>
> proxmox-rest-server/src/worker_task.rs | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
>
applied, thanks!
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [RFC proxmox v2 3/4] rest-server: make worker task creation error handling more idiomatic
2024-12-02 13:04 ` [pbs-devel] [RFC proxmox v2 3/4] rest-server: make worker task creation error handling more idiomatic Fabian Grünbichler
@ 2024-12-02 15:58 ` Thomas Lamprecht
0 siblings, 0 replies; 9+ messages in thread
From: Thomas Lamprecht @ 2024-12-02 15:58 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Fabian Grünbichler
Am 02.12.24 um 14:04 schrieb Fabian Grünbichler:
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
>
> Notes:
> best I could come up with ;) feel free to fold that one in
>
I skipped this for now, it's IMO about the same (i.e., still feels a bit more
odder than it should be, but really not worth the headache)
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] applied: [RFC proxmox v2 4/4] rest-server: increase task index lock timeout to 15s
2024-12-02 13:04 ` [pbs-devel] [RFC proxmox v2 4/4] rest-server: increase task index lock timeout to 15s Fabian Grünbichler
@ 2024-12-02 15:59 ` Thomas Lamprecht
0 siblings, 0 replies; 9+ messages in thread
From: Thomas Lamprecht @ 2024-12-02 15:59 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Fabian Grünbichler
Am 02.12.24 um 14:04 schrieb Fabian Grünbichler:
> this lock can be quite contended, until the surrounding code is properly split
> to reduce this contention it should help to give the worker task
> creation/cleanup code a bit more breathing room.
>
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> proxmox-rest-server/src/worker_task.rs | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>
applied, thanks!
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-12-02 15:59 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-02 13:04 [pbs-devel] [PATCH proxmox v2 0/4] worker task setup improvements Fabian Grünbichler
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 1/4] rest-server: handle failure in worker task setup correctly Fabian Grünbichler
2024-12-02 15:57 ` [pbs-devel] applied: " Thomas Lamprecht
2024-12-02 13:04 ` [pbs-devel] [PATCH proxmox v2 2/4] rest-server: close race window when updating worker task count Fabian Grünbichler
2024-12-02 15:57 ` [pbs-devel] applied: " Thomas Lamprecht
2024-12-02 13:04 ` [pbs-devel] [RFC proxmox v2 3/4] rest-server: make worker task creation error handling more idiomatic Fabian Grünbichler
2024-12-02 15:58 ` Thomas Lamprecht
2024-12-02 13:04 ` [pbs-devel] [RFC proxmox v2 4/4] rest-server: increase task index lock timeout to 15s Fabian Grünbichler
2024-12-02 15:59 ` [pbs-devel] applied: " Thomas Lamprecht
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox