* [pbs-devel] [RFC proxmox 0/2] worker task setup improvements @ 2024-11-29 13:13 Fabian Grünbichler 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 1/2] rest-server: handle failure in worker task setup correctly Fabian Grünbichler ` (3 more replies) 0 siblings, 4 replies; 10+ messages in thread From: Fabian Grünbichler @ 2024-11-29 13:13 UTC (permalink / raw) To: pbs-devel two of the more prominent issues, RFC for now since we want more testing and more follow-ups before applying. Fabian Grünbichler (2): rest-server: handle failure in worker task setup correctly rest-server: close race window when updating worker task count proxmox-rest-server/src/worker_task.rs | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) -- 2.39.5 _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* [pbs-devel] [PATCH proxmox 1/2] rest-server: handle failure in worker task setup correctly 2024-11-29 13:13 [pbs-devel] [RFC proxmox 0/2] worker task setup improvements Fabian Grünbichler @ 2024-11-29 13:13 ` Fabian Grünbichler 2024-11-29 13:34 ` Thomas Lamprecht 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 2/2] rest-server: close race window when updating worker task count Fabian Grünbichler ` (2 subsequent siblings) 3 siblings, 1 reply; 10+ messages in thread From: Fabian Grünbichler @ 2024-11-29 13:13 UTC (permalink / raw) To: pbs-devel if setting up a new worker fails after it has been inserted into the WORKER_TASK_LIST, we need to clean it up instead of bubbling up the error right away, else we "leak" the worker task and it never finishes.. Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> --- we probably want to optimize update_active_workers as well to reduce the lock contention there that triggers this issue in the first place.. proxmox-rest-server/src/worker_task.rs | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs index 6e76c2ca..3ca93965 100644 --- a/proxmox-rest-server/src/worker_task.rs +++ b/proxmox-rest-server/src/worker_task.rs @@ -923,7 +923,12 @@ impl WorkerTask { set_worker_count(hash.len()); } - setup.update_active_workers(Some(&upid))?; + let res = setup.update_active_workers(Some(&upid)); + if res.is_err() { + // needed to undo the insertion into WORKER_TASK_LIST above + worker.log_result(&res); + res? + } Ok((worker, logger)) } -- 2.39.5 _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [pbs-devel] [PATCH proxmox 1/2] rest-server: handle failure in worker task setup correctly 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 1/2] rest-server: handle failure in worker task setup correctly Fabian Grünbichler @ 2024-11-29 13:34 ` Thomas Lamprecht 2024-12-02 9:14 ` Fabian Grünbichler 0 siblings, 1 reply; 10+ messages in thread From: Thomas Lamprecht @ 2024-11-29 13:34 UTC (permalink / raw) To: Proxmox Backup Server development discussion, Fabian Grünbichler Am 29.11.24 um 14:13 schrieb Fabian Grünbichler: > if setting up a new worker fails after it has been inserted into the > WORKER_TASK_LIST, we need to clean it up instead of bubbling up the error right > away, else we "leak" the worker task and it never finishes.. > > Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> > --- > we probably want to optimize update_active_workers as well to reduce the lock > contention there that triggers this issue in the first place.. > > proxmox-rest-server/src/worker_task.rs | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs > index 6e76c2ca..3ca93965 100644 > --- a/proxmox-rest-server/src/worker_task.rs > +++ b/proxmox-rest-server/src/worker_task.rs > @@ -923,7 +923,12 @@ impl WorkerTask { > set_worker_count(hash.len()); > } > > - setup.update_active_workers(Some(&upid))?; > + let res = setup.update_active_workers(Some(&upid)); > + if res.is_err() { > + // needed to undo the insertion into WORKER_TASK_LIST above > + worker.log_result(&res); > + res? > + } Seems OK from a quick look, need a bit more time for a proper review. What the quick look can give though is style nits, i.e. IMO a bit unidiomatic for our code. Would prefer one of: Combined return path through matching match setup.update_active_workers(Some(&upid)) { Err(err) => { // needed to undo the insertion into the active WORKER_TASK_LIST above worker.log_result(&res); Err(err) } Ok(_) => Ok((worker, logger)) } or similar than yours but avoid the outer variable: if let Err(err) = setup.update_active_workers(Some(&upid)) { // needed to undo the insertion into the active WORKER_TASK_LIST above worker.log_result(&res); return Err(err); } IMO both fit slightly (!) better for how errors are commonly dealt with in rust and are thus a bit easier to understand correctly on reading. > > Ok((worker, logger)) > } _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [pbs-devel] [PATCH proxmox 1/2] rest-server: handle failure in worker task setup correctly 2024-11-29 13:34 ` Thomas Lamprecht @ 2024-12-02 9:14 ` Fabian Grünbichler 0 siblings, 0 replies; 10+ messages in thread From: Fabian Grünbichler @ 2024-12-02 9:14 UTC (permalink / raw) To: Proxmox Backup Server development discussion, Thomas Lamprecht On November 29, 2024 2:34 pm, Thomas Lamprecht wrote: > Am 29.11.24 um 14:13 schrieb Fabian Grünbichler: >> if setting up a new worker fails after it has been inserted into the >> WORKER_TASK_LIST, we need to clean it up instead of bubbling up the error right >> away, else we "leak" the worker task and it never finishes.. >> >> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> >> --- >> we probably want to optimize update_active_workers as well to reduce the lock >> contention there that triggers this issue in the first place.. >> >> proxmox-rest-server/src/worker_task.rs | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs >> index 6e76c2ca..3ca93965 100644 >> --- a/proxmox-rest-server/src/worker_task.rs >> +++ b/proxmox-rest-server/src/worker_task.rs >> @@ -923,7 +923,12 @@ impl WorkerTask { >> set_worker_count(hash.len()); >> } >> >> - setup.update_active_workers(Some(&upid))?; >> + let res = setup.update_active_workers(Some(&upid)); >> + if res.is_err() { >> + // needed to undo the insertion into WORKER_TASK_LIST above >> + worker.log_result(&res); >> + res? >> + } > > Seems OK from a quick look, need a bit more time for a proper review. > > What the quick look can give though is style nits, i.e. IMO a bit unidiomatic for our > code. > > Would prefer one of: > > Combined return path through matching > > match setup.update_active_workers(Some(&upid)) { > Err(err) => { > // needed to undo the insertion into the active WORKER_TASK_LIST above > worker.log_result(&res); > Err(err) > } > Ok(_) => Ok((worker, logger)) > } > > or similar than yours but avoid the outer variable: > > if let Err(err) = setup.update_active_workers(Some(&upid)) { > // needed to undo the insertion into the active WORKER_TASK_LIST above > worker.log_result(&res); > return Err(err); > } > > IMO both fit slightly (!) better for how errors are commonly dealt with in rust and > are thus a bit easier to understand correctly on reading. neither of those work though, since both the log_result and the return value need the Err(err), and err is not Clone.. maybe there is a way to make it work, I didn't find one quickly last week and want to hand over something to work with to Dominik ;) maybe I am missing some easy way out though.. > >> >> Ok((worker, logger)) >> } > > _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* [pbs-devel] [PATCH proxmox 2/2] rest-server: close race window when updating worker task count 2024-11-29 13:13 [pbs-devel] [RFC proxmox 0/2] worker task setup improvements Fabian Grünbichler 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 1/2] rest-server: handle failure in worker task setup correctly Fabian Grünbichler @ 2024-11-29 13:13 ` Fabian Grünbichler 2024-11-29 13:27 ` Thomas Lamprecht 2024-11-29 14:53 ` [pbs-devel] [RFC proxmox 0/2] worker task setup improvements Dominik Csapak 2024-12-02 13:04 ` [pbs-devel] superseded: " Fabian Grünbichler 3 siblings, 1 reply; 10+ messages in thread From: Fabian Grünbichler @ 2024-11-29 13:13 UTC (permalink / raw) To: pbs-devel this mimics how the count is updated when spawning a new task - the lock scope needs to cover the count update itself, else there's a race when multiple worker's log their result at the same time.. Co-developed-by: Dominik Csapak <d.csapak@proxmox.com> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> --- proxmox-rest-server/src/worker_task.rs | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs index 3ca93965..018d18c0 100644 --- a/proxmox-rest-server/src/worker_task.rs +++ b/proxmox-rest-server/src/worker_task.rs @@ -1023,7 +1023,8 @@ impl WorkerTask { WORKER_TASK_LIST.lock().unwrap().remove(&self.upid.task_id); let _ = self.setup.update_active_workers(None); - set_worker_count(WORKER_TASK_LIST.lock().unwrap().len()); + let lock = WORKER_TASK_LIST.lock().unwrap(); + set_worker_count(lock.len()); } /// Log a message. -- 2.39.5 _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [pbs-devel] [PATCH proxmox 2/2] rest-server: close race window when updating worker task count 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 2/2] rest-server: close race window when updating worker task count Fabian Grünbichler @ 2024-11-29 13:27 ` Thomas Lamprecht 2024-11-29 14:20 ` Dominik Csapak 0 siblings, 1 reply; 10+ messages in thread From: Thomas Lamprecht @ 2024-11-29 13:27 UTC (permalink / raw) To: Proxmox Backup Server development discussion, Fabian Grünbichler Am 29.11.24 um 14:13 schrieb Fabian Grünbichler: > this mimics how the count is updated when spawning a new task - the lock scope > needs to cover the count update itself, else there's a race when multiple > worker's log their result at the same time.. > > Co-developed-by: Dominik Csapak <d.csapak@proxmox.com> > Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> > --- > proxmox-rest-server/src/worker_task.rs | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs > index 3ca93965..018d18c0 100644 > --- a/proxmox-rest-server/src/worker_task.rs > +++ b/proxmox-rest-server/src/worker_task.rs > @@ -1023,7 +1023,8 @@ impl WorkerTask { > > WORKER_TASK_LIST.lock().unwrap().remove(&self.upid.task_id); > let _ = self.setup.update_active_workers(None); > - set_worker_count(WORKER_TASK_LIST.lock().unwrap().len()); > + let lock = WORKER_TASK_LIST.lock().unwrap(); why not use this also for the remove operation above? I.e. something like: let locked_worker_tasks = WORKER_TASK_LIST.lock().unwrap(); locked_worker_tasks.remove(&self.upid.task_id); set_worker_count(locked_worker_tasks.len()) If there are technical reason speaking against this, which I hope not, then a comment would be definitively warranted, otherwise using a single lock would IMO make this a bit clearer and locking twice isn't exactly cheaper. Looks OK besides that, but would still want to take a closer look. > + set_worker_count(lock.len()); > } > > /// Log a message. _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [pbs-devel] [PATCH proxmox 2/2] rest-server: close race window when updating worker task count 2024-11-29 13:27 ` Thomas Lamprecht @ 2024-11-29 14:20 ` Dominik Csapak 2024-12-02 9:14 ` Fabian Grünbichler 0 siblings, 1 reply; 10+ messages in thread From: Dominik Csapak @ 2024-11-29 14:20 UTC (permalink / raw) To: Proxmox Backup Server development discussion, Thomas Lamprecht, Fabian Grünbichler On 11/29/24 14:27, Thomas Lamprecht wrote: > Am 29.11.24 um 14:13 schrieb Fabian Grünbichler: >> this mimics how the count is updated when spawning a new task - the lock scope >> needs to cover the count update itself, else there's a race when multiple >> worker's log their result at the same time.. >> >> Co-developed-by: Dominik Csapak <d.csapak@proxmox.com> >> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> >> --- >> proxmox-rest-server/src/worker_task.rs | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs >> index 3ca93965..018d18c0 100644 >> --- a/proxmox-rest-server/src/worker_task.rs >> +++ b/proxmox-rest-server/src/worker_task.rs >> @@ -1023,7 +1023,8 @@ impl WorkerTask { >> >> WORKER_TASK_LIST.lock().unwrap().remove(&self.upid.task_id); >> let _ = self.setup.update_active_workers(None); >> - set_worker_count(WORKER_TASK_LIST.lock().unwrap().len()); >> + let lock = WORKER_TASK_LIST.lock().unwrap(); > > why not use this also for the remove operation above? I.e. something like: > > let locked_worker_tasks = WORKER_TASK_LIST.lock().unwrap(); > > locked_worker_tasks.remove(&self.upid.task_id); > > set_worker_count(locked_worker_tasks.len()) > > If there are technical reason speaking against this, which I hope not, then a > comment would be definitively warranted, otherwise using a single lock would > IMO make this a bit clearer and locking twice isn't exactly cheaper. here the reason of the split lock is that the 'self.setup.update_active_workers` internally can take a lock to the WORKER_TASK_LIST, so we can't hold one over that call not super sure if can reorder these, so that we reduce the count before updating though. From what i understand though we want to remove ourselves from the list of actives tasks before reducing that counter. as fabian indicated in the other patch, we should probably split up the 'update_active_workers' into seperate methods to * add one worker * remove one worker * housekeeping for leftover workers then we could design the removal in a way that does not rely on the WORKER_TASK_LIST in the first place thus we could remove it from the active list before removing it from the internal hashmap (and could take a lock around both, the list and the count) > > Looks OK besides that, but would still want to take a closer look. > >> + set_worker_count(lock.len()); >> } >> >> /// Log a message. > > > > _______________________________________________ > pbs-devel mailing list > pbs-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [pbs-devel] [PATCH proxmox 2/2] rest-server: close race window when updating worker task count 2024-11-29 14:20 ` Dominik Csapak @ 2024-12-02 9:14 ` Fabian Grünbichler 0 siblings, 0 replies; 10+ messages in thread From: Fabian Grünbichler @ 2024-12-02 9:14 UTC (permalink / raw) To: Dominik Csapak, Proxmox Backup Server development discussion, Thomas Lamprecht On November 29, 2024 3:20 pm, Dominik Csapak wrote: > On 11/29/24 14:27, Thomas Lamprecht wrote: >> Am 29.11.24 um 14:13 schrieb Fabian Grünbichler: >>> this mimics how the count is updated when spawning a new task - the lock scope >>> needs to cover the count update itself, else there's a race when multiple >>> worker's log their result at the same time.. >>> >>> Co-developed-by: Dominik Csapak <d.csapak@proxmox.com> >>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> >>> --- >>> proxmox-rest-server/src/worker_task.rs | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/proxmox-rest-server/src/worker_task.rs b/proxmox-rest-server/src/worker_task.rs >>> index 3ca93965..018d18c0 100644 >>> --- a/proxmox-rest-server/src/worker_task.rs >>> +++ b/proxmox-rest-server/src/worker_task.rs >>> @@ -1023,7 +1023,8 @@ impl WorkerTask { >>> >>> WORKER_TASK_LIST.lock().unwrap().remove(&self.upid.task_id); >>> let _ = self.setup.update_active_workers(None); >>> - set_worker_count(WORKER_TASK_LIST.lock().unwrap().len()); >>> + let lock = WORKER_TASK_LIST.lock().unwrap(); >> >> why not use this also for the remove operation above? I.e. something like: >> >> let locked_worker_tasks = WORKER_TASK_LIST.lock().unwrap(); >> >> locked_worker_tasks.remove(&self.upid.task_id); >> >> set_worker_count(locked_worker_tasks.len()) >> >> If there are technical reason speaking against this, which I hope not, then a >> comment would be definitively warranted, otherwise using a single lock would >> IMO make this a bit clearer and locking twice isn't exactly cheaper. > > here the reason of the split lock is that the 'self.setup.update_active_workers` internally > can take a lock to the WORKER_TASK_LIST, so we can't hold one over that call > > not super sure if can reorder these, so that we reduce the count before updating > though. From what i understand though we want to remove ourselves from the list > of actives tasks before reducing that counter. > > as fabian indicated in the other patch, we should probably split up > the 'update_active_workers' into seperate methods to > * add one worker > * remove one worker > * housekeeping for leftover workers > > then we could design the removal in a way that does not rely on the WORKER_TASK_LIST > in the first place thus we could remove it from the active list before removing it > from the internal hashmap (and could take a lock around both, the list and the count) yes to all of the above. and a comment why the lock is obtained twice probably is a good ideal for the stop-gap fix. > >> >> Looks OK besides that, but would still want to take a closer look. >> >>> + set_worker_count(lock.len()); >>> } >>> >>> /// Log a message. >> >> >> >> _______________________________________________ >> pbs-devel mailing list >> pbs-devel@lists.proxmox.com >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel > > > _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [pbs-devel] [RFC proxmox 0/2] worker task setup improvements 2024-11-29 13:13 [pbs-devel] [RFC proxmox 0/2] worker task setup improvements Fabian Grünbichler 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 1/2] rest-server: handle failure in worker task setup correctly Fabian Grünbichler 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 2/2] rest-server: close race window when updating worker task count Fabian Grünbichler @ 2024-11-29 14:53 ` Dominik Csapak 2024-12-02 13:04 ` [pbs-devel] superseded: " Fabian Grünbichler 3 siblings, 0 replies; 10+ messages in thread From: Dominik Csapak @ 2024-11-29 14:53 UTC (permalink / raw) To: Fabian Grünbichler, pbs-devel regardless of the style comments thomas already wrote, just wanted to report that it fixed my local testcase: * start very many restore tasks in parallel, until they run into an error with task state update (did with proxmox-backup-client restore > /dev/null) * reload the daemon * stop all the tasks (e.g. by killing the backup-client on the client side) see that there are no more running tasks in the gui, but still old running proxmox-backup-proxy daemons This is fixed with the first patch of the series _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
* [pbs-devel] superseded: [RFC proxmox 0/2] worker task setup improvements 2024-11-29 13:13 [pbs-devel] [RFC proxmox 0/2] worker task setup improvements Fabian Grünbichler ` (2 preceding siblings ...) 2024-11-29 14:53 ` [pbs-devel] [RFC proxmox 0/2] worker task setup improvements Dominik Csapak @ 2024-12-02 13:04 ` Fabian Grünbichler 3 siblings, 0 replies; 10+ messages in thread From: Fabian Grünbichler @ 2024-12-02 13:04 UTC (permalink / raw) To: Proxmox Backup Server development discussion sent a v2: https://lore.proxmox.com/pbs-devel/20241202130412.1290616-1-f.gruenbichler@proxmox.com On November 29, 2024 2:13 pm, Fabian Grünbichler wrote: > two of the more prominent issues, RFC for now since we want more testing and > more follow-ups before applying. > > Fabian Grünbichler (2): > rest-server: handle failure in worker task setup correctly > rest-server: close race window when updating worker task count > > proxmox-rest-server/src/worker_task.rs | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > -- > 2.39.5 > > > > _______________________________________________ > pbs-devel mailing list > pbs-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel > _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-12-02 13:04 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-11-29 13:13 [pbs-devel] [RFC proxmox 0/2] worker task setup improvements Fabian Grünbichler 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 1/2] rest-server: handle failure in worker task setup correctly Fabian Grünbichler 2024-11-29 13:34 ` Thomas Lamprecht 2024-12-02 9:14 ` Fabian Grünbichler 2024-11-29 13:13 ` [pbs-devel] [PATCH proxmox 2/2] rest-server: close race window when updating worker task count Fabian Grünbichler 2024-11-29 13:27 ` Thomas Lamprecht 2024-11-29 14:20 ` Dominik Csapak 2024-12-02 9:14 ` Fabian Grünbichler 2024-11-29 14:53 ` [pbs-devel] [RFC proxmox 0/2] worker task setup improvements Dominik Csapak 2024-12-02 13:04 ` [pbs-devel] superseded: " Fabian Grünbichler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox