all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [pdm-devel] [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks
@ 2025-11-28 14:05 Lukas Wagner
  2025-11-28 14:05 ` [pdm-devel] [PATCH datacenter-manager 1/2] remote tasks: poll foreign, non-tracked active tasks to avoid them getting stuck Lukas Wagner
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Lukas Wagner @ 2025-11-28 14:05 UTC (permalink / raw)
  To: pdm-devel

Lukas Wagner (2):
  remote tasks: poll foreign, non-tracked active tasks to avoid them
    getting stuck
  remote tasks: make sure to update the task cache if there were errors
    when polling

 .../tasks/remote_tasks.rs                     | 76 +++++++++++++++++--
 1 file changed, 68 insertions(+), 8 deletions(-)

-- 
2.47.3



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pdm-devel] [PATCH datacenter-manager 1/2] remote tasks: poll foreign, non-tracked active tasks to avoid them getting stuck
  2025-11-28 14:05 [pdm-devel] [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks Lukas Wagner
@ 2025-11-28 14:05 ` Lukas Wagner
  2025-11-28 14:05 ` [pdm-devel] [PATCH datacenter-manager 2/2] remote tasks: make sure to update the task cache if there were errors when polling Lukas Wagner
  2025-11-30  1:17 ` [pdm-devel] applied: [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks Thomas Lamprecht
  2 siblings, 0 replies; 4+ messages in thread
From: Lukas Wagner @ 2025-11-28 14:05 UTC (permalink / raw)
  To: pdm-devel

If tasks which are currently active but not tracked (not started by PDM)
are added to the task cache, then under some special circumstances, they
can get stuck in the 'active' state. This happened mostly due to a bug
that was already fixed [1]. As a safeguard and to fix existing stuck
tasks, we now poll active tasks with a long interval (10min) and check
if they are finished.

This polling is done as part of the regular poll loop and does not
result in additional API calls if there are no active, non-tracked
tasks.

[1] fixed in 6247ff3c7

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 .../tasks/remote_tasks.rs                     | 70 +++++++++++++++++--
 1 file changed, 63 insertions(+), 7 deletions(-)

diff --git a/server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs b/server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs
index d3c8395e..c8d50183 100644
--- a/server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs
+++ b/server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs
@@ -20,7 +20,7 @@ use server::{
     pbs_client,
     remote_tasks::{
         self,
-        task_cache::{NodeFetchSuccessMap, State, TaskCache, TaskCacheItem},
+        task_cache::{GetTasks, NodeFetchSuccessMap, State, TaskCache, TaskCacheItem},
         KEEP_OLD_FILES, REMOTE_TASKS_DIR, ROTATE_AFTER,
     },
     task_utils,
@@ -34,6 +34,14 @@ const POLL_INTERVAL: Duration = Duration::from_secs(10);
 /// task for this remote).
 const TASK_FETCH_INTERVAL: Duration = Duration::from_secs(600);
 
+/// Interval in seconds at which we poll active tasks. This only really affects 'foreign' (as in,
+/// not started by PDM) tasks. Tasks which were started by PDM are always 'tracked' and therefore
+/// polled at the interval set in [`POLL_INTERVAL`].
+// NOTE: Since we at the moment never query active tasks from remotes, this is merely a safeguard
+// to clear stuck active tasks from a previous bug. If we at some point query active tasks, we
+// might lower this interval.
+const POLL_ACTIVE_INTERVAL: Duration = Duration::from_secs(600);
+
 /// Interval at which to check for task cache rotation.
 const CHECK_ROTATE_INTERVAL: Duration = Duration::from_secs(3600);
 
@@ -63,6 +71,9 @@ struct TaskState {
     last_fetch: Instant,
     /// Time at which we last applied the journal.
     last_journal_apply: Instant,
+    /// Time at which we polled active tasks. This is done to ensure that
+    /// active tasks are never stuck in the 'active' state
+    last_active_poll: Instant,
 }
 
 impl TaskState {
@@ -73,6 +84,7 @@ impl TaskState {
             last_rotate_check: now - CHECK_ROTATE_INTERVAL,
             last_fetch: now - TASK_FETCH_INTERVAL,
             last_journal_apply: now - APPLY_JOURNAL_INTERVAL,
+            last_active_poll: now - POLL_ACTIVE_INTERVAL,
         }
     }
 
@@ -91,6 +103,11 @@ impl TaskState {
         self.last_journal_apply = Instant::now();
     }
 
+    /// Reset the journal apply timestamp.
+    fn reset_active_poll(&mut self) {
+        self.last_active_poll = Instant::now();
+    }
+
     /// Should we check for archive rotation?
     fn is_due_for_rotate_check(&self) -> bool {
         Instant::now().duration_since(self.last_rotate_check) > CHECK_ROTATE_INTERVAL
@@ -105,6 +122,11 @@ impl TaskState {
     fn is_due_for_journal_apply(&self) -> bool {
         Instant::now().duration_since(self.last_journal_apply) > APPLY_JOURNAL_INTERVAL
     }
+
+    /// Should we poll active tasks?
+    fn is_due_for_active_poll(&self) -> bool {
+        Instant::now().duration_since(self.last_active_poll) > POLL_ACTIVE_INTERVAL
+    }
 }
 
 /// Start the remote task fetching task
@@ -171,12 +193,32 @@ async fn do_tick(task_state: &mut TaskState) -> Result<(), Error> {
     let total_connections_semaphore = Arc::new(Semaphore::new(MAX_CONNECTIONS));
 
     let cache_state = cache.read_state();
-    let poll_results = poll_tracked_tasks(
-        &remote_config,
-        cache_state.tracked_tasks(),
-        Arc::clone(&total_connections_semaphore),
-    )
-    .await?;
+
+    let poll_results = if task_state.is_due_for_active_poll() {
+        let mut tasks_to_poll: HashSet<RemoteUpid> =
+            HashSet::from_iter(cache_state.tracked_tasks().cloned());
+
+        let active_tasks = get_active_tasks(cache.clone()).await?;
+        tasks_to_poll.extend(active_tasks.into_iter());
+
+        let poll_results = poll_tracked_tasks(
+            &remote_config,
+            tasks_to_poll.iter(),
+            Arc::clone(&total_connections_semaphore),
+        )
+        .await?;
+
+        task_state.reset_active_poll();
+
+        poll_results
+    } else {
+        poll_tracked_tasks(
+            &remote_config,
+            cache_state.tracked_tasks(),
+            Arc::clone(&total_connections_semaphore),
+        )
+        .await?
+    };
 
     // Get a list of remotes that we should poll in this cycle.
     let remotes = if task_state.is_due_for_fetch() {
@@ -357,6 +399,20 @@ async fn apply_journal(cache: TaskCache) -> Result<(), Error> {
     tokio::task::spawn_blocking(move || cache.write()?.apply_journal()).await?
 }
 
+/// Get a list of active tasks.
+async fn get_active_tasks(cache: TaskCache) -> Result<Vec<RemoteUpid>, Error> {
+    Ok(tokio::task::spawn_blocking(move || {
+        let tasks: Vec<RemoteUpid> = cache
+            .read()?
+            .get_tasks(GetTasks::Active)?
+            .map(|t| t.upid)
+            .collect();
+
+        Ok::<Vec<RemoteUpid>, Error>(tasks)
+    })
+    .await??)
+}
+
 #[derive(PartialEq, Debug)]
 /// Outcome from polling a tracked task.
 enum PollResult {
-- 
2.47.3



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pdm-devel] [PATCH datacenter-manager 2/2] remote tasks: make sure to update the task cache if there were errors when polling
  2025-11-28 14:05 [pdm-devel] [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks Lukas Wagner
  2025-11-28 14:05 ` [pdm-devel] [PATCH datacenter-manager 1/2] remote tasks: poll foreign, non-tracked active tasks to avoid them getting stuck Lukas Wagner
@ 2025-11-28 14:05 ` Lukas Wagner
  2025-11-30  1:17 ` [pdm-devel] applied: [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks Thomas Lamprecht
  2 siblings, 0 replies; 4+ messages in thread
From: Lukas Wagner @ 2025-11-28 14:05 UTC (permalink / raw)
  To: pdm-devel

This is important since the cache implementation will drop any failed
polled tasks from the active file and from the tracked task list.

One instance where the request can fail is when polling a tasks that has
already been rotated out from the task archive on the remote side.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs b/server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs
index c8d50183..c71a0894 100644
--- a/server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs
+++ b/server/src/bin/proxmox-datacenter-api/tasks/remote_tasks.rs
@@ -230,7 +230,11 @@ async fn do_tick(task_state: &mut TaskState) -> Result<(), Error> {
 
     let (all_tasks, update_state_for_remote) = fetch_remotes(remotes, Arc::new(cache_state)).await;
 
-    if !all_tasks.is_empty() {
+    if !all_tasks.is_empty()
+        || poll_results
+            .iter()
+            .any(|(_, result)| matches!(result, PollResult::RemoteGone | PollResult::RequestError))
+    {
         update_task_cache(cache, all_tasks, update_state_for_remote, poll_results).await?;
     }
 
-- 
2.47.3



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pdm-devel] applied: [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks
  2025-11-28 14:05 [pdm-devel] [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks Lukas Wagner
  2025-11-28 14:05 ` [pdm-devel] [PATCH datacenter-manager 1/2] remote tasks: poll foreign, non-tracked active tasks to avoid them getting stuck Lukas Wagner
  2025-11-28 14:05 ` [pdm-devel] [PATCH datacenter-manager 2/2] remote tasks: make sure to update the task cache if there were errors when polling Lukas Wagner
@ 2025-11-30  1:17 ` Thomas Lamprecht
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2025-11-30  1:17 UTC (permalink / raw)
  To: pdm-devel, Lukas Wagner

On Fri, 28 Nov 2025 15:05:20 +0100, Lukas Wagner wrote:
> Lukas Wagner (2):
>   remote tasks: poll foreign, non-tracked active tasks to avoid them
>     getting stuck
>   remote tasks: make sure to update the task cache if there were errors
>     when polling
> 
> .../tasks/remote_tasks.rs                     | 76 +++++++++++++++++--
>  1 file changed, 68 insertions(+), 8 deletions(-)
> 
> [...]

Applied, thanks!

[1/2] remote tasks: poll foreign, non-tracked active tasks to avoid them getting stuck
      commit: 85db447c35b2581eae1874f40b737f12f8dfc98e
[2/2] remote tasks: make sure to update the task cache if there were errors when polling
      commit: 6163111507ea7ec6bc9cf0e9b256a1f62ffbdfce


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-11-30  1:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-28 14:05 [pdm-devel] [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks Lukas Wagner
2025-11-28 14:05 ` [pdm-devel] [PATCH datacenter-manager 1/2] remote tasks: poll foreign, non-tracked active tasks to avoid them getting stuck Lukas Wagner
2025-11-28 14:05 ` [pdm-devel] [PATCH datacenter-manager 2/2] remote tasks: make sure to update the task cache if there were errors when polling Lukas Wagner
2025-11-30  1:17 ` [pdm-devel] applied: [PATCH datacenter-manager 0/2] remote tasks: avoid stuck running tasks Thomas Lamprecht

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal