public inbox for pdm-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO
@ 2025-01-28 12:25 Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 01/15] pdm-api-types: derive Debug and PartialEq for TaskListItem Lukas Wagner
                   ` (15 more replies)
  0 siblings, 16 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

This patch series changes the remote task caching behavior from being purely
time-based cache to a FIFO cache-replacement policy with a maximum number of
cached tasks per remote. If the maximum number is exceeded, the oldest tasks
are dropped from the cache.

When calling the remote-tasks API, the latest missing task data which is not
yet in the cache is requested from the remotes. There we limit this to once
every 5 minutes at the moment, with the option for a force-refresh (to be triggered
by a refresh button in the UI). As before, we augment the cached task data
with the currently running tracked tasks which were started by PDM.

Some words about the cache storage implementation:
Note that the storage backend for this cache probably needs some more love in
the future. Right now its just a single JSON file for everything, mainly because this
was the quickest approach to implement to unblock UI development work. 
The problem with the approach is that it does not perform well with setups
with a large number of remotes, since every update to the cache rewrites
the entire cache file when the cache is persisted, causing additional
CPU and IO load.

In the future, we should use a similar mechanism as the task archive in PBS.
I'm not sure if the exact same mechanism can be used due to some different
requirements, but the general direct probably fits quite well.
If we can reuse it 1:1, we have to break it out of (iirc) the WorkerTask struct
to be reusable.
It will probably require some experimentation and benchmarking to find an
ideal approach.
We probably don't want a separate archive per remote, since we do not want to
read hundres/thousands of files when we request an aggregated remote task history
via the API. But having the complete archive in a single file also seems quite
challenging - we need to keep the data sorted, while we also need to
handle task data arriving out of order from different remotes. Keeping
the data sorted when new data arrives leads us to the same problem as with
JSON file, being that we have to rewrite the file over and over again, causing
load and writes.

The good news is that since this is just a cache, we are pretty free to change
the storage implementation without too much trouble; we don't even have to
migrate the old data, since it should not be an issue to simply request the
data from the remotes again. This is the main reason why I've opted
to keep the JSON file for now; I or somebody else can revisit this at a later
time.

proxmox-datacenter-manager:

Lukas Wagner (15):
  pdm-api-types: derive Debug and PartialEq for TaskListItem
  test support: add NamedTempFile helper
  task cache: add basic test for TaskCache
  task cache: remove max-age machanism
  task cache: add FIFO cache replacement policy
  remote tasks: move to dir based module
  task cache: move to its own submodule
  task cache: fetch every 5mins, requesting only missing tasks
  remote tasks: return tasks in stable order
  remote tasks: allow to force-fetch latest tasks
  fake remote: add missing fields to make the debug feature compile
    again
  fake remote: generate fake task data
  task cache: tests: improve test coverage
  remote tasks: fix unused variable warning
  remote-tasks: restrict function visibility

 lib/pdm-api-types/src/lib.rs                  |   2 +-
 server/src/api/pve/mod.rs                     |   4 +-
 server/src/api/remote_tasks.rs                |  18 +-
 server/src/lib.rs                             |   4 +-
 .../{task_cache.rs => remote_tasks/mod.rs}    | 253 ++++--------
 server/src/remote_tasks/task_cache.rs         | 365 ++++++++++++++++++
 server/src/test_support/fake_remote.rs        |  81 +++-
 server/src/test_support/mod.rs                |   4 +
 server/src/test_support/temp.rs               |  33 ++
 9 files changed, 563 insertions(+), 201 deletions(-)
 rename server/src/{task_cache.rs => remote_tasks/mod.rs} (62%)
 create mode 100644 server/src/remote_tasks/task_cache.rs
 create mode 100644 server/src/test_support/temp.rs


Summary over all repositories:
  9 files changed, 563 insertions(+), 201 deletions(-)

-- 
Generated by git-murpp 0.8.0


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 01/15] pdm-api-types: derive Debug and PartialEq for TaskListItem
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 02/15] test support: add NamedTempFile helper Lukas Wagner
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

Those are needed for the test case for the task cache that will follow
in one of the next commits.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 lib/pdm-api-types/src/lib.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/pdm-api-types/src/lib.rs b/lib/pdm-api-types/src/lib.rs
index 3844907..5ac2a69 100644
--- a/lib/pdm-api-types/src/lib.rs
+++ b/lib/pdm-api-types/src/lib.rs
@@ -237,7 +237,7 @@ pub enum TaskStateType {
         upid: { schema: UPID::API_SCHEMA },
     },
 )]
-#[derive(Clone, Serialize, Deserialize)]
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
 /// Task properties.
 pub struct TaskListItem {
     pub upid: String,
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 02/15] test support: add NamedTempFile helper
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 01/15] pdm-api-types: derive Debug and PartialEq for TaskListItem Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 03/15] task cache: add basic test for TaskCache Lukas Wagner
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

This one is useful when writing tests, it automatically removes the
temporary file when dropped. The name was chosen because of the similar
NamedTempFile struct in the popular tempfile crate.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/lib.rs               |  2 +-
 server/src/test_support/mod.rs  |  4 ++++
 server/src/test_support/temp.rs | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+), 1 deletion(-)
 create mode 100644 server/src/test_support/temp.rs

diff --git a/server/src/lib.rs b/server/src/lib.rs
index 12dc912..143ee32 100644
--- a/server/src/lib.rs
+++ b/server/src/lib.rs
@@ -13,7 +13,7 @@ pub mod task_utils;
 pub mod connection;
 pub mod pbs_client;
 
-#[cfg(remote_config = "faked")]
+#[cfg(any(remote_config = "faked", test))]
 pub mod test_support;
 
 use anyhow::Error;
diff --git a/server/src/test_support/mod.rs b/server/src/test_support/mod.rs
index e54cd72..f026011 100644
--- a/server/src/test_support/mod.rs
+++ b/server/src/test_support/mod.rs
@@ -1 +1,5 @@
+#[cfg(remote_config = "faked")]
 pub mod fake_remote;
+
+#[cfg(test)]
+pub mod temp;
diff --git a/server/src/test_support/temp.rs b/server/src/test_support/temp.rs
new file mode 100644
index 0000000..a3a6d59
--- /dev/null
+++ b/server/src/test_support/temp.rs
@@ -0,0 +1,33 @@
+use std::path::{Path, PathBuf};
+
+use anyhow::Error;
+
+use proxmox_sys::fs::CreateOptions;
+
+/// Temporary file that be cleaned up when dropped.
+pub struct NamedTempFile {
+    path: PathBuf,
+}
+
+impl NamedTempFile {
+    /// Create a new temporary file.
+    ///
+    /// The file will be created with the passed [`CreateOptions`].
+    pub fn new(options: CreateOptions) -> Result<Self, Error> {
+        let base = std::env::temp_dir().join("test");
+        let (_, path) = proxmox_sys::fs::make_tmp_file(base, options)?;
+
+        Ok(Self { path })
+    }
+
+    /// Return the [`Path`] to the temporary file.
+    pub fn path(&self) -> &Path {
+        &self.path
+    }
+}
+
+impl Drop for NamedTempFile {
+    fn drop(&mut self) {
+        let _ = std::fs::remove_file(&self.path);
+    }
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 03/15] task cache: add basic test for TaskCache
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 01/15] pdm-api-types: derive Debug and PartialEq for TaskListItem Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 02/15] test support: add NamedTempFile helper Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 04/15] task cache: remove max-age machanism Lukas Wagner
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

This one helps us to verify that we did not break the core functionality
in the upcoming changes.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/task_cache.rs | 100 +++++++++++++++++++++++++++++++--------
 1 file changed, 79 insertions(+), 21 deletions(-)

diff --git a/server/src/task_cache.rs b/server/src/task_cache.rs
index 7ded540..211beb4 100644
--- a/server/src/task_cache.rs
+++ b/server/src/task_cache.rs
@@ -25,8 +25,13 @@ pub async fn get_tasks(max_age: i64, filters: TaskFilters) -> Result<Vec<TaskLis
 
     let mut all_tasks = Vec::new();
 
+    let api_uid = pdm_config::api_user()?.uid;
+    let api_gid = pdm_config::api_group()?.gid;
+
+    let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
+
     let cache_path = Path::new(pdm_buildcfg::PDM_CACHE_DIR).join("taskcache.json");
-    let mut cache = TaskCache::new(cache_path)?;
+    let mut cache = TaskCache::new(cache_path, file_options)?;
 
     // Force a refresh for all tasks of a remote if a task is finished.
     // Not super nice, but saves us from persisting finished tasks. Also,
@@ -246,27 +251,30 @@ struct TaskCache {
 
     /// File-location at which the cached tasks are stored.
     cachefile_path: PathBuf,
+
+    /// File mode/owner/group for the cache file.
+    cachefile_options: CreateOptions,
 }
 
 impl TaskCache {
     /// Create a new tasks cache instance by loading
     /// the cache from disk.
-    fn new(cachefile_path: PathBuf) -> Result<Self, Error> {
+    fn new(cachefile_path: PathBuf, cachefile_options: CreateOptions) -> Result<Self, Error> {
         Ok(Self {
-            content: Self::load_content()?,
+            content: Self::load_content(&cachefile_path)?,
             new_or_updated: Default::default(),
             dirty: false,
             cachefile_path,
+            cachefile_options,
         })
     }
 
     /// Load the task cache contents from disk.
-    fn load_content() -> Result<TaskCacheContent, Error> {
-        let taskcache_path = Path::new(pdm_buildcfg::PDM_CACHE_DIR).join("taskcache.json");
-        let content = proxmox_sys::fs::file_read_optional_string(taskcache_path)?;
+    fn load_content(path: &Path) -> Result<TaskCacheContent, Error> {
+        let content = proxmox_sys::fs::file_read_optional_string(path)?;
 
         let content = if let Some(content) = content {
-            serde_json::from_str(&content)?
+            serde_json::from_str(&content).unwrap_or_default()
         } else {
             Default::default()
         };
@@ -293,7 +301,7 @@ impl TaskCache {
         let _guard = self.lock(Duration::from_secs(5))?;
 
         // Read content again, in case somebody has changed it in the meanwhile
-        let mut content = Self::load_content()?;
+        let mut content = Self::load_content(&self.cachefile_path)?;
 
         for (remote_name, entry) in self.new_or_updated.remote_tasks.drain() {
             if let Some(existing_entry) = content.remote_tasks.get_mut(&remote_name) {
@@ -308,12 +316,12 @@ impl TaskCache {
 
         let bytes = serde_json::to_vec_pretty(&content)?;
 
-        let api_uid = pdm_config::api_user()?.uid;
-        let api_gid = pdm_config::api_group()?.gid;
-
-        let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
-        proxmox_sys::fs::replace_file(&self.cachefile_path, &bytes, file_options, true)?;
+        proxmox_sys::fs::replace_file(
+            &self.cachefile_path,
+            &bytes,
+            self.cachefile_options.clone(),
+            true,
+        )?;
 
         self.dirty = false;
 
@@ -358,22 +366,23 @@ impl TaskCache {
     // without a lock, since the cache file is replaced atomically
     // when updating.
     fn lock(&self, duration: Duration) -> Result<File, Error> {
-        let api_uid = pdm_config::api_user()?.uid;
-        let api_gid = pdm_config::api_group()?.gid;
-
-        let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
-        proxmox_sys::fs::open_file_locked(self.lockfile_path(), duration, true, file_options)
+        proxmox_sys::fs::open_file_locked(
+            self.lockfile_path(),
+            duration,
+            true,
+            self.cachefile_options.clone(),
+        )
     }
 }
 
-#[derive(Serialize, Deserialize)]
+#[derive(Debug, Serialize, Deserialize)]
 /// Per-remote entry in the task cache.
 struct TaskCacheEntry {
     timestamp: i64,
     tasks: Vec<TaskListItem>,
 }
 
-#[derive(Default, Serialize, Deserialize)]
+#[derive(Debug, Default, Serialize, Deserialize)]
 /// Content of the task cache file.
 struct TaskCacheContent {
     remote_tasks: HashMap<String, TaskCacheEntry>,
@@ -522,3 +531,52 @@ pub fn tasktype(status: &str) -> TaskStateType {
         TaskStateType::Error
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::test_support::temp::NamedTempFile;
+
+    #[test]
+    fn basic_task_cache() -> Result<(), Error> {
+        let options = CreateOptions::new()
+            .owner(nix::unistd::Uid::effective())
+            .group(nix::unistd::Gid::effective())
+            .perm(nix::sys::stat::Mode::from_bits_truncate(0o600));
+
+        let temp_file = NamedTempFile::new(options.clone())?;
+
+        let mut cache = TaskCache::new(temp_file.path().into(), options.clone())?;
+
+        let mut tasks = Vec::new();
+
+        let now = proxmox_time::epoch_i64();
+        for i in (0..20).rev() {
+            let upid: PveUpid =
+                "UPID:pve-node:0000C530:001C9BEC:677E934A:stopall::root@pam:".parse()?;
+
+            tasks.push(TaskListItem {
+                upid: upid.to_string(),
+                node: upid.node,
+                pid: upid.pid as i64,
+                pstart: upid.pstart,
+                starttime: now - 10 * i,
+                worker_type: upid.worker_type,
+                worker_id: upid.worker_id,
+                user: upid.auth_id,
+                endtime: None,
+                status: None,
+            });
+        }
+
+        cache.set_tasks("some-remote", tasks.clone(), now);
+        cache.save()?;
+
+        let cache = TaskCache::new(temp_file.path().into(), options)?;
+
+        let res = cache.get_tasks("some-remote", now, 10000).unwrap();
+        assert_eq!(tasks, res);
+
+        Ok(())
+    }
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 04/15] task cache: remove max-age machanism
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (2 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 03/15] task cache: add basic test for TaskCache Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-29 18:27   ` Thomas Lamprecht
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 05/15] task cache: add FIFO cache replacement policy Lukas Wagner
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

This commit removes the time-based caching policy for remote tasks. It
will be replaced by another cache replacement policy based on total
number of tasks in an upcoming commit.

Suggested-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/remote_tasks.rs | 11 ++---------
 server/src/task_cache.rs       | 31 +++++++++----------------------
 2 files changed, 11 insertions(+), 31 deletions(-)

diff --git a/server/src/api/remote_tasks.rs b/server/src/api/remote_tasks.rs
index 57b59fd..da3b718 100644
--- a/server/src/api/remote_tasks.rs
+++ b/server/src/api/remote_tasks.rs
@@ -21,13 +21,6 @@ const SUBDIRS: SubdirMap = &sorted!([("list", &Router::new().get(&API_METHOD_LIS
     },
     input: {
         properties: {
-            "max-age": {
-                type: Integer,
-                optional: true,
-                // TODO: sensible default max-age
-                default: 300,
-                description: "Maximum age of cached task data",
-            },
             filters: {
                 type: TaskFilters,
                 flatten: true,
@@ -36,8 +29,8 @@ const SUBDIRS: SubdirMap = &sorted!([("list", &Router::new().get(&API_METHOD_LIS
     },
 )]
 /// Get the list of tasks for all remotes.
-async fn list_tasks(max_age: i64, filters: TaskFilters) -> Result<Vec<TaskListItem>, Error> {
-    let tasks = task_cache::get_tasks(max_age, filters).await?;
+async fn list_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error> {
+    let tasks = task_cache::get_tasks(filters).await?;
 
     Ok(tasks)
 }
diff --git a/server/src/task_cache.rs b/server/src/task_cache.rs
index 211beb4..f24af3f 100644
--- a/server/src/task_cache.rs
+++ b/server/src/task_cache.rs
@@ -20,7 +20,7 @@ use crate::{api::pve, task_utils};
 
 /// Get tasks for all remotes
 // FIXME: filter for privileges
-pub async fn get_tasks(max_age: i64, filters: TaskFilters) -> Result<Vec<TaskListItem>, Error> {
+pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error> {
     let (remotes, _) = pdm_config::remotes::config()?;
 
     let mut all_tasks = Vec::new();
@@ -42,9 +42,7 @@ pub async fn get_tasks(max_age: i64, filters: TaskFilters) -> Result<Vec<TaskLis
     invalidate_cache_for_finished_tasks(&mut cache);
 
     for (remote_name, remote) in &remotes.sections {
-        let now = proxmox_time::epoch_i64();
-
-        if let Some(tasks) = cache.get_tasks(remote_name.as_str(), now, max_age) {
+        if let Some(tasks) = cache.get_tasks(remote_name.as_str()) {
             // Data in cache is recent enough and has not been invalidated.
             all_tasks.extend(tasks);
         } else {
@@ -55,7 +53,7 @@ pub async fn get_tasks(max_age: i64, filters: TaskFilters) -> Result<Vec<TaskLis
                     continue;
                 }
             };
-            cache.set_tasks(remote_name.as_str(), tasks.clone(), now);
+            cache.set_tasks(remote_name.as_str(), tasks.clone());
 
             all_tasks.extend(tasks);
         }
@@ -305,10 +303,7 @@ impl TaskCache {
 
         for (remote_name, entry) in self.new_or_updated.remote_tasks.drain() {
             if let Some(existing_entry) = content.remote_tasks.get_mut(&remote_name) {
-                // Only update entry if nobody else has updated it in the meanwhile
-                if existing_entry.timestamp < entry.timestamp {
-                    *existing_entry = entry;
-                }
+                *existing_entry = entry;
             } else {
                 content.remote_tasks.insert(remote_name, entry);
             }
@@ -329,25 +324,18 @@ impl TaskCache {
     }
 
     // Update task data for a given remote.
-    fn set_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>, timestamp: i64) {
+    fn set_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>) {
         self.dirty = true;
         self.new_or_updated
             .remote_tasks
-            .insert(remote.to_string(), TaskCacheEntry { timestamp, tasks });
+            .insert(remote.to_string(), TaskCacheEntry { tasks });
     }
 
     // Get task data for a given remote.
-    fn get_tasks(&self, remote: &str, now: i64, max_age: i64) -> Option<Vec<TaskListItem>> {
+    fn get_tasks(&self, remote: &str) -> Option<Vec<TaskListItem>> {
         if let Some(entry) = self.content.remote_tasks.get(remote) {
-            if (entry.timestamp + max_age) < now {
-                return None;
-            }
-
             Some(entry.tasks.clone())
         } else if let Some(entry) = self.new_or_updated.remote_tasks.get(remote) {
-            if (entry.timestamp + max_age) < now {
-                return None;
-            }
             Some(entry.tasks.clone())
         } else {
             None
@@ -378,7 +366,6 @@ impl TaskCache {
 #[derive(Debug, Serialize, Deserialize)]
 /// Per-remote entry in the task cache.
 struct TaskCacheEntry {
-    timestamp: i64,
     tasks: Vec<TaskListItem>,
 }
 
@@ -569,12 +556,12 @@ mod tests {
             });
         }
 
-        cache.set_tasks("some-remote", tasks.clone(), now);
+        cache.set_tasks("some-remote", tasks.clone());
         cache.save()?;
 
         let cache = TaskCache::new(temp_file.path().into(), options)?;
 
-        let res = cache.get_tasks("some-remote", now, 10000).unwrap();
+        let res = cache.get_tasks("some-remote").unwrap();
         assert_eq!(tasks, res);
 
         Ok(())
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 05/15] task cache: add FIFO cache replacement policy
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (3 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 04/15] task cache: remove max-age machanism Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 06/15] remote tasks: move to dir based module Lukas Wagner
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

The task cache will now hold up to a certain number of tasks
*per remote*. If the capacity is exceeded, the oldest tasks,
based on the task's starttime, will be dropped.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/task_cache.rs | 193 +++++++++++++++++++++++++++++++--------
 1 file changed, 153 insertions(+), 40 deletions(-)

diff --git a/server/src/task_cache.rs b/server/src/task_cache.rs
index f24af3f..032f2a4 100644
--- a/server/src/task_cache.rs
+++ b/server/src/task_cache.rs
@@ -1,6 +1,8 @@
 use std::{
+    cmp::Ordering,
     collections::{HashMap, HashSet},
     fs::File,
+    iter::Peekable,
     path::{Path, PathBuf},
     sync::{LazyLock, RwLock},
     time::Duration,
@@ -18,6 +20,9 @@ use tokio::task::JoinHandle;
 
 use crate::{api::pve, task_utils};
 
+// TODO: Does this number make sense?
+const CACHED_TASKS_PER_REMOTE: usize = 2000;
+
 /// Get tasks for all remotes
 // FIXME: filter for privileges
 pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error> {
@@ -31,7 +36,7 @@ pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error>
     let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
 
     let cache_path = Path::new(pdm_buildcfg::PDM_CACHE_DIR).join("taskcache.json");
-    let mut cache = TaskCache::new(cache_path, file_options)?;
+    let mut cache = TaskCache::new(cache_path, file_options, CACHED_TASKS_PER_REMOTE)?;
 
     // Force a refresh for all tasks of a remote if a task is finished.
     // Not super nice, but saves us from persisting finished tasks. Also,
@@ -53,7 +58,7 @@ pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error>
                     continue;
                 }
             };
-            cache.set_tasks(remote_name.as_str(), tasks.clone());
+            cache.add_tasks(remote_name.as_str(), tasks.clone());
 
             all_tasks.extend(tasks);
         }
@@ -240,10 +245,6 @@ struct TaskCache {
     /// Cache entries
     content: TaskCacheContent,
 
-    /// Entries that were added or updated - these will be persistet
-    /// when `save` is called.
-    new_or_updated: TaskCacheContent,
-
     /// Cache entries were changed/removed.
     dirty: bool,
 
@@ -252,18 +253,25 @@ struct TaskCache {
 
     /// File mode/owner/group for the cache file.
     cachefile_options: CreateOptions,
+
+    /// Max. tasks per remote
+    max_tasks_per_remote: usize,
 }
 
 impl TaskCache {
     /// Create a new tasks cache instance by loading
     /// the cache from disk.
-    fn new(cachefile_path: PathBuf, cachefile_options: CreateOptions) -> Result<Self, Error> {
+    fn new(
+        cachefile_path: PathBuf,
+        cachefile_options: CreateOptions,
+        max_tasks_per_remote: usize,
+    ) -> Result<Self, Error> {
         Ok(Self {
             content: Self::load_content(&cachefile_path)?,
-            new_or_updated: Default::default(),
             dirty: false,
             cachefile_path,
             cachefile_options,
+            max_tasks_per_remote,
         })
     }
 
@@ -301,15 +309,14 @@ impl TaskCache {
         // Read content again, in case somebody has changed it in the meanwhile
         let mut content = Self::load_content(&self.cachefile_path)?;
 
-        for (remote_name, entry) in self.new_or_updated.remote_tasks.drain() {
-            if let Some(existing_entry) = content.remote_tasks.get_mut(&remote_name) {
-                *existing_entry = entry;
-            } else {
-                content.remote_tasks.insert(remote_name, entry);
+        for (remote, entry) in self.content.remote_tasks.iter_mut() {
+            if let Some(other) = content.remote_tasks.remove(remote) {
+                entry.tasks =
+                    Self::merge_tasks(entry.tasks.clone(), other.tasks, self.max_tasks_per_remote);
             }
         }
 
-        let bytes = serde_json::to_vec_pretty(&content)?;
+        let bytes = serde_json::to_vec_pretty(&self.content)?;
 
         proxmox_sys::fs::replace_file(
             &self.cachefile_path,
@@ -323,20 +330,22 @@ impl TaskCache {
         Ok(())
     }
 
-    // Update task data for a given remote.
-    fn set_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>) {
+    /// Add tasks to the cache.
+    ///
+    /// If the total number of stored tasks exceeds `max_tasks_per_remote`, the
+    /// oldest ones are truncated.
+    fn add_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>) {
+        let entry = self.content.remote_tasks.entry(remote.into()).or_default();
+
+        entry.tasks = Self::merge_tasks(entry.tasks.clone(), tasks, self.max_tasks_per_remote);
+
         self.dirty = true;
-        self.new_or_updated
-            .remote_tasks
-            .insert(remote.to_string(), TaskCacheEntry { tasks });
     }
 
     // Get task data for a given remote.
     fn get_tasks(&self, remote: &str) -> Option<Vec<TaskListItem>> {
         if let Some(entry) = self.content.remote_tasks.get(remote) {
             Some(entry.tasks.clone())
-        } else if let Some(entry) = self.new_or_updated.remote_tasks.get(remote) {
-            Some(entry.tasks.clone())
         } else {
             None
         }
@@ -361,9 +370,72 @@ impl TaskCache {
             self.cachefile_options.clone(),
         )
     }
+
+    fn merge_tasks(
+        mut a: Vec<TaskListItem>,
+        mut b: Vec<TaskListItem>,
+        limit: usize,
+    ) -> Vec<TaskListItem> {
+        a.sort_by_key(|task| -task.starttime);
+        b.sort_by_key(|task| -task.starttime);
+
+        MergeTaskIterator {
+            left: a.into_iter().peekable(),
+            right: b.into_iter().peekable(),
+        }
+        .take(limit)
+        .collect()
+    }
+}
+
+struct MergeTaskIterator<T: Iterator<Item = TaskListItem>> {
+    left: Peekable<T>,
+    right: Peekable<T>,
+}
+
+impl<T> Iterator for MergeTaskIterator<T>
+where
+    T: Iterator<Item = TaskListItem>,
+{
+    type Item = T::Item;
+
+    fn next(&mut self) -> Option<T::Item> {
+        let order = match (self.left.peek(), self.right.peek()) {
+            (Some(l), Some(r)) => Some(l.starttime.cmp(&r.starttime)),
+            (Some(_), None) => Some(Ordering::Greater),
+            (None, Some(_)) => Some(Ordering::Less),
+            (None, None) => None,
+        };
+
+        match order {
+            Some(Ordering::Greater) => self.left.next(),
+            Some(Ordering::Less) => self.right.next(),
+            Some(Ordering::Equal) => {
+                // Both unwraps are safe, the former peek/match
+                // guarantess that there is an element.
+                let l = self.left.peek().unwrap();
+                let r = self.right.peek().unwrap();
+
+                // Dedup if both lists contain the same task
+                if l.upid == r.upid {
+                    // Take the one where the task is already finished
+                    if l.endtime.is_some() {
+                        let _ = self.right.next();
+                        self.left.next()
+                    } else {
+                        let _ = self.left.next();
+                        self.right.next()
+                    }
+                } else {
+                    self.left.next()
+                }
+            }
+            None => None,
+        }
+    }
 }
 
-#[derive(Debug, Serialize, Deserialize)]
+#[derive(Default, Debug, Serialize, Deserialize)]
 /// Per-remote entry in the task cache.
 struct TaskCacheEntry {
     tasks: Vec<TaskListItem>,
@@ -524,6 +596,29 @@ mod tests {
     use super::*;
     use crate::test_support::temp::NamedTempFile;
 
+    fn make_upid(
+        starttime: i64,
+        endtime: Option<i64>,
+        status: Option<String>,
+    ) -> Result<TaskListItem, Error> {
+        let upid: PveUpid =
+            format!("UPID:pve-node:0000C530:001C9BEC:{starttime:08X}:stopall::root@pam:",)
+                .parse()?;
+
+        Ok(TaskListItem {
+            upid: upid.to_string(),
+            node: upid.node,
+            pid: upid.pid as i64,
+            pstart: upid.pstart,
+            starttime: upid.starttime,
+            worker_type: upid.worker_type,
+            worker_id: upid.worker_id,
+            user: upid.auth_id,
+            endtime,
+            status,
+        })
+    }
+
     #[test]
     fn basic_task_cache() -> Result<(), Error> {
         let options = CreateOptions::new()
@@ -533,37 +628,55 @@ mod tests {
 
         let temp_file = NamedTempFile::new(options.clone())?;
 
-        let mut cache = TaskCache::new(temp_file.path().into(), options.clone())?;
+        let mut cache = TaskCache::new(temp_file.path().into(), options.clone(), 50)?;
 
         let mut tasks = Vec::new();
 
         let now = proxmox_time::epoch_i64();
         for i in (0..20).rev() {
-            let upid: PveUpid =
-                "UPID:pve-node:0000C530:001C9BEC:677E934A:stopall::root@pam:".parse()?;
-
-            tasks.push(TaskListItem {
-                upid: upid.to_string(),
-                node: upid.node,
-                pid: upid.pid as i64,
-                pstart: upid.pstart,
-                starttime: now - 10 * i,
-                worker_type: upid.worker_type,
-                worker_id: upid.worker_id,
-                user: upid.auth_id,
-                endtime: None,
-                status: None,
-            });
+            tasks.push(make_upid(now - 10 * i, None, None)?);
         }
 
-        cache.set_tasks("some-remote", tasks.clone());
+        cache.add_tasks("some-remote", tasks.clone());
         cache.save()?;
 
-        let cache = TaskCache::new(temp_file.path().into(), options)?;
+        let cache = TaskCache::new(temp_file.path().into(), options, 50)?;
 
         let res = cache.get_tasks("some-remote").unwrap();
+        tasks.reverse();
         assert_eq!(tasks, res);
 
         Ok(())
     }
+
+    #[test]
+    fn merge_tasks() -> Result<(), Error> {
+        // Arrange
+        let mut a = Vec::new();
+        for i in [30, 10, 20] {
+            a.push(make_upid(i, None, None)?);
+        }
+
+        let mut b = Vec::new();
+        for i in [25, 15, 35, 5] {
+            b.push(make_upid(i, None, None)?);
+        }
+
+        a.push(make_upid(40, None, None)?);
+        b.push(make_upid(40, Some(50), Some("some status".into()))?);
+
+        // Act
+        let tasks = TaskCache::merge_tasks(a, b, 5);
+
+        // Assert
+        assert_eq!(tasks.len(), 5);
+        assert_eq!(tasks[0].starttime, 40);
+        assert_eq!(tasks[0].endtime, Some(50));
+        assert_eq!(tasks[1].starttime, 35);
+        assert_eq!(tasks[2].starttime, 30);
+        assert_eq!(tasks[3].starttime, 25);
+        assert_eq!(tasks[4].starttime, 20);
+
+        Ok(())
+    }
 }
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 06/15] remote tasks: move to dir based module
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (4 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 05/15] task cache: add FIFO cache replacement policy Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 07/15] task cache: move to its own submodule Lukas Wagner
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

The module is becoming quite large at this point, a dir based module
allows us to split it apart in future commits.

No functional changes.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/pve/mod.rs                         | 4 ++--
 server/src/api/remote_tasks.rs                    | 4 ++--
 server/src/lib.rs                                 | 2 +-
 server/src/{task_cache.rs => remote_tasks/mod.rs} | 0
 4 files changed, 5 insertions(+), 5 deletions(-)
 rename server/src/{task_cache.rs => remote_tasks/mod.rs} (100%)

diff --git a/server/src/api/pve/mod.rs b/server/src/api/pve/mod.rs
index 2cefbb4..c328675 100644
--- a/server/src/api/pve/mod.rs
+++ b/server/src/api/pve/mod.rs
@@ -27,7 +27,7 @@ use pve_api_types::{
 
 use super::resources::{map_pve_lxc, map_pve_node, map_pve_qemu, map_pve_storage};
 
-use crate::{connection, task_cache};
+use crate::{connection, remote_tasks};
 
 mod lxc;
 mod node;
@@ -76,7 +76,7 @@ const STATUS_ROUTER: Router = Router::new().get(&API_METHOD_CLUSTER_STATUS);
 // converts a remote + PveUpid into a RemoteUpid and starts tracking it
 fn new_remote_upid(remote: String, upid: PveUpid) -> Result<RemoteUpid, Error> {
     let remote_upid: RemoteUpid = (remote, upid.to_string()).try_into()?;
-    task_cache::track_running_task(remote_upid.clone());
+    remote_tasks::track_running_task(remote_upid.clone());
     Ok(remote_upid)
 }
 
diff --git a/server/src/api/remote_tasks.rs b/server/src/api/remote_tasks.rs
index da3b718..05ce366 100644
--- a/server/src/api/remote_tasks.rs
+++ b/server/src/api/remote_tasks.rs
@@ -4,7 +4,7 @@ use proxmox_router::{list_subdirs_api_method, Permission, Router, SubdirMap};
 use proxmox_schema::api;
 use proxmox_sortable_macro::sortable;
 
-use crate::task_cache;
+use crate::remote_tasks;
 
 pub const ROUTER: Router = Router::new()
     .get(&list_subdirs_api_method!(SUBDIRS))
@@ -30,7 +30,7 @@ const SUBDIRS: SubdirMap = &sorted!([("list", &Router::new().get(&API_METHOD_LIS
 )]
 /// Get the list of tasks for all remotes.
 async fn list_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error> {
-    let tasks = task_cache::get_tasks(filters).await?;
+    let tasks = remote_tasks::get_tasks(filters).await?;
 
     Ok(tasks)
 }
diff --git a/server/src/lib.rs b/server/src/lib.rs
index 143ee32..4320e46 100644
--- a/server/src/lib.rs
+++ b/server/src/lib.rs
@@ -6,8 +6,8 @@ pub mod auth;
 pub mod context;
 pub mod env;
 pub mod metric_collection;
+pub mod remote_tasks;
 pub mod resource_cache;
-pub mod task_cache;
 pub mod task_utils;
 
 pub mod connection;
diff --git a/server/src/task_cache.rs b/server/src/remote_tasks/mod.rs
similarity index 100%
rename from server/src/task_cache.rs
rename to server/src/remote_tasks/mod.rs
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 07/15] task cache: move to its own submodule
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (5 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 06/15] remote tasks: move to dir based module Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 08/15] task cache: fetch every 5mins, requesting only missing tasks Lukas Wagner
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

No functional changes, only adapting method visibility where needed.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/remote_tasks/mod.rs        | 309 +------------------------
 server/src/remote_tasks/task_cache.rs | 312 ++++++++++++++++++++++++++
 2 files changed, 317 insertions(+), 304 deletions(-)
 create mode 100644 server/src/remote_tasks/task_cache.rs

diff --git a/server/src/remote_tasks/mod.rs b/server/src/remote_tasks/mod.rs
index 032f2a4..4a0552c 100644
--- a/server/src/remote_tasks/mod.rs
+++ b/server/src/remote_tasks/mod.rs
@@ -1,11 +1,7 @@
 use std::{
-    cmp::Ordering,
-    collections::{HashMap, HashSet},
-    fs::File,
-    iter::Peekable,
-    path::{Path, PathBuf},
+    collections::HashSet,
+    path::Path,
     sync::{LazyLock, RwLock},
-    time::Duration,
 };
 
 use anyhow::Error;
@@ -15,11 +11,13 @@ use pdm_api_types::{
 };
 use proxmox_sys::fs::CreateOptions;
 use pve_api_types::{ListTasks, ListTasksResponse, ListTasksSource, PveUpid};
-use serde::{Deserialize, Serialize};
 use tokio::task::JoinHandle;
 
 use crate::{api::pve, task_utils};
 
+mod task_cache;
+use task_cache::TaskCache;
+
 // TODO: Does this number make sense?
 const CACHED_TASKS_PER_REMOTE: usize = 2000;
 
@@ -240,213 +238,6 @@ fn add_running_tasks(cached_tasks: Vec<TaskListItem>) -> Result<Vec<TaskListItem
     Ok(returned_tasks)
 }
 
-/// A cache for fetched remote tasks.
-struct TaskCache {
-    /// Cache entries
-    content: TaskCacheContent,
-
-    /// Cache entries were changed/removed.
-    dirty: bool,
-
-    /// File-location at which the cached tasks are stored.
-    cachefile_path: PathBuf,
-
-    /// File mode/owner/group for the cache file.
-    cachefile_options: CreateOptions,
-
-    /// Max. tasks per remote
-    max_tasks_per_remote: usize,
-}
-
-impl TaskCache {
-    /// Create a new tasks cache instance by loading
-    /// the cache from disk.
-    fn new(
-        cachefile_path: PathBuf,
-        cachefile_options: CreateOptions,
-        max_tasks_per_remote: usize,
-    ) -> Result<Self, Error> {
-        Ok(Self {
-            content: Self::load_content(&cachefile_path)?,
-            dirty: false,
-            cachefile_path,
-            cachefile_options,
-            max_tasks_per_remote,
-        })
-    }
-
-    /// Load the task cache contents from disk.
-    fn load_content(path: &Path) -> Result<TaskCacheContent, Error> {
-        let content = proxmox_sys::fs::file_read_optional_string(path)?;
-
-        let content = if let Some(content) = content {
-            serde_json::from_str(&content).unwrap_or_default()
-        } else {
-            Default::default()
-        };
-
-        Ok(content)
-    }
-
-    /// Get path for the cache's lockfile.
-    fn lockfile_path(&self) -> PathBuf {
-        let mut path = self.cachefile_path.clone();
-        path.set_extension("lock");
-        path
-    }
-
-    /// Persist the task cache
-    ///
-    /// This method requests an exclusive lock for the task cache lockfile.
-    fn save(&mut self) -> Result<(), Error> {
-        // if we have not updated anything, we don't have to update the cache file
-        if !self.dirty {
-            return Ok(());
-        }
-
-        let _guard = self.lock(Duration::from_secs(5))?;
-
-        // Read content again, in case somebody has changed it in the meanwhile
-        let mut content = Self::load_content(&self.cachefile_path)?;
-
-        for (remote, entry) in self.content.remote_tasks.iter_mut() {
-            if let Some(other) = content.remote_tasks.remove(remote) {
-                entry.tasks =
-                    Self::merge_tasks(entry.tasks.clone(), other.tasks, self.max_tasks_per_remote);
-            }
-        }
-
-        let bytes = serde_json::to_vec_pretty(&self.content)?;
-
-        proxmox_sys::fs::replace_file(
-            &self.cachefile_path,
-            &bytes,
-            self.cachefile_options.clone(),
-            true,
-        )?;
-
-        self.dirty = false;
-
-        Ok(())
-    }
-
-    /// Add tasks to the cache.
-    ///
-    /// If the total number of stored tasks exceeds `max_tasks_per_remote`, the
-    /// oldest ones are truncated.
-    fn add_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>) {
-        let entry = self.content.remote_tasks.entry(remote.into()).or_default();
-
-        entry.tasks = Self::merge_tasks(entry.tasks.clone(), tasks, self.max_tasks_per_remote);
-
-        self.dirty = true;
-    }
-
-    // Get task data for a given remote.
-    fn get_tasks(&self, remote: &str) -> Option<Vec<TaskListItem>> {
-        if let Some(entry) = self.content.remote_tasks.get(remote) {
-            Some(entry.tasks.clone())
-        } else {
-            None
-        }
-    }
-
-    // Invalidate cache for a given remote.
-    fn invalidate_cache_for_remote(&mut self, remote: &str) {
-        self.dirty = true;
-        self.content.remote_tasks.remove(remote);
-    }
-
-    // Lock the cache for modification.
-    //
-    // While the cache is locked, other users can still read the cache
-    // without a lock, since the cache file is replaced atomically
-    // when updating.
-    fn lock(&self, duration: Duration) -> Result<File, Error> {
-        proxmox_sys::fs::open_file_locked(
-            self.lockfile_path(),
-            duration,
-            true,
-            self.cachefile_options.clone(),
-        )
-    }
-
-    fn merge_tasks(
-        mut a: Vec<TaskListItem>,
-        mut b: Vec<TaskListItem>,
-        limit: usize,
-    ) -> Vec<TaskListItem> {
-        a.sort_by_key(|task| -task.starttime);
-        b.sort_by_key(|task| -task.starttime);
-
-        MergeTaskIterator {
-            left: a.into_iter().peekable(),
-            right: b.into_iter().peekable(),
-        }
-        .take(limit)
-        .collect()
-    }
-}
-
-struct MergeTaskIterator<T: Iterator<Item = TaskListItem>> {
-    left: Peekable<T>,
-    right: Peekable<T>,
-}
-
-impl<T> Iterator for MergeTaskIterator<T>
-where
-    T: Iterator<Item = TaskListItem>,
-{
-    type Item = T::Item;
-
-    fn next(&mut self) -> Option<T::Item> {
-        let order = match (self.left.peek(), self.right.peek()) {
-            (Some(l), Some(r)) => Some(l.starttime.cmp(&r.starttime)),
-            (Some(_), None) => Some(Ordering::Greater),
-            (None, Some(_)) => Some(Ordering::Less),
-            (None, None) => None,
-        };
-
-        match order {
-            Some(Ordering::Greater) => self.left.next(),
-            Some(Ordering::Less) => self.right.next(),
-            Some(Ordering::Equal) => {
-                // Both unwraps are safe, the former peek/match
-                // guarantess that there is an element.
-                let l = self.left.peek().unwrap();
-                let r = self.right.peek().unwrap();
-
-                // Dedup if both lists contain the same task
-                if l.upid == r.upid {
-                    // Take the one where the task is already finished
-                    if l.endtime.is_some() {
-                        let _ = self.right.next();
-                        self.left.next()
-                    } else {
-                        let _ = self.left.next();
-                        self.right.next()
-                    }
-                } else {
-                    self.left.next()
-                }
-            }
-            None => None,
-        }
-    }
-}
-
-#[derive(Default, Debug, Serialize, Deserialize)]
-/// Per-remote entry in the task cache.
-struct TaskCacheEntry {
-    tasks: Vec<TaskListItem>,
-}
-
-#[derive(Debug, Default, Serialize, Deserialize)]
-/// Content of the task cache file.
-struct TaskCacheContent {
-    remote_tasks: HashMap<String, TaskCacheEntry>,
-}
-
 /// Interval at which tracked tasks are polled
 const RUNNING_CHECK_INTERVAL_S: u64 = 10;
 
@@ -590,93 +381,3 @@ pub fn tasktype(status: &str) -> TaskStateType {
         TaskStateType::Error
     }
 }
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-    use crate::test_support::temp::NamedTempFile;
-
-    fn make_upid(
-        starttime: i64,
-        endtime: Option<i64>,
-        status: Option<String>,
-    ) -> Result<TaskListItem, Error> {
-        let upid: PveUpid =
-            format!("UPID:pve-node:0000C530:001C9BEC:{starttime:08X}:stopall::root@pam:",)
-                .parse()?;
-
-        Ok(TaskListItem {
-            upid: upid.to_string(),
-            node: upid.node,
-            pid: upid.pid as i64,
-            pstart: upid.pstart,
-            starttime: upid.starttime,
-            worker_type: upid.worker_type,
-            worker_id: upid.worker_id,
-            user: upid.auth_id,
-            endtime,
-            status,
-        })
-    }
-
-    #[test]
-    fn basic_task_cache() -> Result<(), Error> {
-        let options = CreateOptions::new()
-            .owner(nix::unistd::Uid::effective())
-            .group(nix::unistd::Gid::effective())
-            .perm(nix::sys::stat::Mode::from_bits_truncate(0o600));
-
-        let temp_file = NamedTempFile::new(options.clone())?;
-
-        let mut cache = TaskCache::new(temp_file.path().into(), options.clone(), 50)?;
-
-        let mut tasks = Vec::new();
-
-        let now = proxmox_time::epoch_i64();
-        for i in (0..20).rev() {
-            tasks.push(make_upid(now - 10 * i, None, None)?);
-        }
-
-        cache.add_tasks("some-remote", tasks.clone());
-        cache.save()?;
-
-        let cache = TaskCache::new(temp_file.path().into(), options, 50)?;
-
-        let res = cache.get_tasks("some-remote").unwrap();
-        tasks.reverse();
-        assert_eq!(tasks, res);
-
-        Ok(())
-    }
-
-    #[test]
-    fn merge_tasks() -> Result<(), Error> {
-        // Arrange
-        let mut a = Vec::new();
-        for i in [30, 10, 20] {
-            a.push(make_upid(i, None, None)?);
-        }
-
-        let mut b = Vec::new();
-        for i in [25, 15, 35, 5] {
-            b.push(make_upid(i, None, None)?);
-        }
-
-        a.push(make_upid(40, None, None)?);
-        b.push(make_upid(40, Some(50), Some("some status".into()))?);
-
-        // Act
-        let tasks = TaskCache::merge_tasks(a, b, 5);
-
-        // Assert
-        assert_eq!(tasks.len(), 5);
-        assert_eq!(tasks[0].starttime, 40);
-        assert_eq!(tasks[0].endtime, Some(50));
-        assert_eq!(tasks[1].starttime, 35);
-        assert_eq!(tasks[2].starttime, 30);
-        assert_eq!(tasks[3].starttime, 25);
-        assert_eq!(tasks[4].starttime, 20);
-
-        Ok(())
-    }
-}
diff --git a/server/src/remote_tasks/task_cache.rs b/server/src/remote_tasks/task_cache.rs
new file mode 100644
index 0000000..8a98876
--- /dev/null
+++ b/server/src/remote_tasks/task_cache.rs
@@ -0,0 +1,312 @@
+use std::{
+    cmp::Ordering,
+    collections::HashMap,
+    fs::File,
+    iter::Peekable,
+    path::{Path, PathBuf},
+    time::Duration,
+};
+
+use anyhow::Error;
+use serde::{Deserialize, Serialize};
+
+use pdm_api_types::TaskListItem;
+use proxmox_sys::fs::CreateOptions;
+
+/// A cache for fetched remote tasks.
+pub(super) struct TaskCache {
+    /// Cache entries
+    content: TaskCacheContent,
+
+    /// Cache entries were changed/removed.
+    dirty: bool,
+
+    /// File-location at which the cached tasks are stored.
+    cachefile_path: PathBuf,
+
+    /// File mode/owner/group for the cache file.
+    cachefile_options: CreateOptions,
+
+    /// Max. tasks per remote
+    max_tasks_per_remote: usize,
+}
+
+impl TaskCache {
+    /// Create a new tasks cache instance by loading
+    /// the cache from disk.
+    pub(super) fn new(
+        cachefile_path: PathBuf,
+        cachefile_options: CreateOptions,
+        max_tasks_per_remote: usize,
+    ) -> Result<Self, Error> {
+        Ok(Self {
+            content: Self::load_content(&cachefile_path)?,
+            dirty: false,
+            cachefile_path,
+            cachefile_options,
+            max_tasks_per_remote,
+        })
+    }
+
+    /// Load the task cache contents from disk.
+    fn load_content(path: &Path) -> Result<TaskCacheContent, Error> {
+        let content = proxmox_sys::fs::file_read_optional_string(path)?;
+
+        let content = if let Some(content) = content {
+            serde_json::from_str(&content).unwrap_or_default()
+        } else {
+            Default::default()
+        };
+
+        Ok(content)
+    }
+
+    /// Get path for the cache's lockfile.
+    fn lockfile_path(&self) -> PathBuf {
+        let mut path = self.cachefile_path.clone();
+        path.set_extension("lock");
+        path
+    }
+
+    /// Persist the task cache
+    ///
+    /// This method requests an exclusive lock for the task cache lockfile.
+    pub(super) fn save(&mut self) -> Result<(), Error> {
+        // if we have not updated anything, we don't have to update the cache file
+        if !self.dirty {
+            return Ok(());
+        }
+
+        let _guard = self.lock(Duration::from_secs(5))?;
+
+        // Read content again, in case somebody has changed it in the meanwhile
+        let mut content = Self::load_content(&self.cachefile_path)?;
+
+        for (remote, entry) in self.content.remote_tasks.iter_mut() {
+            if let Some(other) = content.remote_tasks.remove(remote) {
+                entry.tasks =
+                    Self::merge_tasks(entry.tasks.clone(), other.tasks, self.max_tasks_per_remote);
+            }
+        }
+
+        let bytes = serde_json::to_vec_pretty(&self.content)?;
+
+        proxmox_sys::fs::replace_file(
+            &self.cachefile_path,
+            &bytes,
+            self.cachefile_options.clone(),
+            true,
+        )?;
+
+        self.dirty = false;
+
+        Ok(())
+    }
+
+    /// Add tasks to the cache.
+    ///
+    /// If the total number of stored tasks exceeds `max_tasks_per_remote`, the
+    /// oldest ones are truncated.
+    pub(super) fn add_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>) {
+        let entry = self.content.remote_tasks.entry(remote.into()).or_default();
+
+        entry.tasks = Self::merge_tasks(entry.tasks.clone(), tasks, self.max_tasks_per_remote);
+
+        self.dirty = true;
+    }
+
+    // Get task data for a given remote.
+    pub(super) fn get_tasks(&self, remote: &str) -> Option<Vec<TaskListItem>> {
+        self.content
+            .remote_tasks
+            .get(remote)
+            .map(|entry| entry.tasks.clone())
+    }
+
+    // Invalidate cache for a given remote.
+    pub(super) fn invalidate_cache_for_remote(&mut self, remote: &str) {
+        self.dirty = true;
+        self.content.remote_tasks.remove(remote);
+    }
+
+    // Lock the cache for modification.
+    //
+    // While the cache is locked, other users can still read the cache
+    // without a lock, since the cache file is replaced atomically
+    // when updating.
+    fn lock(&self, duration: Duration) -> Result<File, Error> {
+        proxmox_sys::fs::open_file_locked(
+            self.lockfile_path(),
+            duration,
+            true,
+            self.cachefile_options.clone(),
+        )
+    }
+
+    fn merge_tasks(
+        mut a: Vec<TaskListItem>,
+        mut b: Vec<TaskListItem>,
+        limit: usize,
+    ) -> Vec<TaskListItem> {
+        a.sort_by_key(|task| -task.starttime);
+        b.sort_by_key(|task| -task.starttime);
+
+        MergeTaskIterator {
+            left: a.into_iter().peekable(),
+            right: b.into_iter().peekable(),
+        }
+        .take(limit)
+        .collect()
+    }
+}
+
+struct MergeTaskIterator<T: Iterator<Item = TaskListItem>> {
+    left: Peekable<T>,
+    right: Peekable<T>,
+}
+
+impl<T> Iterator for MergeTaskIterator<T>
+where
+    T: Iterator<Item = TaskListItem>,
+{
+    type Item = T::Item;
+
+    fn next(&mut self) -> Option<T::Item> {
+        let order = match (self.left.peek(), self.right.peek()) {
+            (Some(l), Some(r)) => Some(l.starttime.cmp(&r.starttime)),
+            (Some(_), None) => Some(Ordering::Greater),
+            (None, Some(_)) => Some(Ordering::Less),
+            (None, None) => None,
+        };
+
+        match order {
+            Some(Ordering::Greater) => self.left.next(),
+            Some(Ordering::Less) => self.right.next(),
+            Some(Ordering::Equal) => {
+                // Both unwraps are safe, the former peek/match
+                // guarantess that there is an element.
+                let l = self.left.peek().unwrap();
+                let r = self.right.peek().unwrap();
+
+                // Dedup if both lists contain the same task
+                if l.upid == r.upid {
+                    // Take the one where the task is already finished
+                    if l.endtime.is_some() {
+                        let _ = self.right.next();
+                        self.left.next()
+                    } else {
+                        let _ = self.left.next();
+                        self.right.next()
+                    }
+                } else {
+                    self.left.next()
+                }
+            }
+            None => None,
+        }
+    }
+}
+
+#[derive(Default, Debug, Serialize, Deserialize)]
+/// Per-remote entry in the task cache.
+struct TaskCacheEntry {
+    tasks: Vec<TaskListItem>,
+}
+
+#[derive(Debug, Default, Serialize, Deserialize)]
+/// Content of the task cache file.
+struct TaskCacheContent {
+    remote_tasks: HashMap<String, TaskCacheEntry>,
+}
+
+#[cfg(test)]
+mod tests {
+    use pve_api_types::PveUpid;
+
+    use super::*;
+    use crate::test_support::temp::NamedTempFile;
+
+    fn make_upid(
+        starttime: i64,
+        endtime: Option<i64>,
+        status: Option<String>,
+    ) -> Result<TaskListItem, Error> {
+        let upid: PveUpid =
+            format!("UPID:pve-node:0000C530:001C9BEC:{starttime:08X}:stopall::root@pam:",)
+                .parse()?;
+
+        Ok(TaskListItem {
+            upid: upid.to_string(),
+            node: upid.node,
+            pid: upid.pid as i64,
+            pstart: upid.pstart,
+            starttime: upid.starttime,
+            worker_type: upid.worker_type,
+            worker_id: upid.worker_id,
+            user: upid.auth_id,
+            endtime,
+            status,
+        })
+    }
+
+    #[test]
+    fn basic_task_cache() -> Result<(), Error> {
+        let options = CreateOptions::new()
+            .owner(nix::unistd::Uid::effective())
+            .group(nix::unistd::Gid::effective())
+            .perm(nix::sys::stat::Mode::from_bits_truncate(0o600));
+
+        let temp_file = NamedTempFile::new(options.clone())?;
+
+        let mut cache = TaskCache::new(temp_file.path().into(), options.clone(), 50)?;
+
+        let mut tasks = Vec::new();
+
+        let now = proxmox_time::epoch_i64();
+        for i in (0..20).rev() {
+            tasks.push(make_upid(now - 10 * i, None, None)?);
+        }
+
+        cache.add_tasks("some-remote", tasks.clone());
+        cache.save()?;
+
+        let cache = TaskCache::new(temp_file.path().into(), options, 50)?;
+
+        let res = cache.get_tasks("some-remote").unwrap();
+        tasks.reverse();
+        assert_eq!(tasks, res);
+
+        Ok(())
+    }
+
+    #[test]
+    fn merge_tasks() -> Result<(), Error> {
+        // Arrange
+        let mut a = Vec::new();
+        for i in [30, 10, 20] {
+            a.push(make_upid(i, None, None)?);
+        }
+
+        let mut b = Vec::new();
+        for i in [25, 15, 35, 5] {
+            b.push(make_upid(i, None, None)?);
+        }
+
+        a.push(make_upid(40, None, None)?);
+        b.push(make_upid(40, Some(50), Some("some status".into()))?);
+
+        // Act
+        let tasks = TaskCache::merge_tasks(a, b, 5);
+
+        // Assert
+        assert_eq!(tasks.len(), 5);
+        assert_eq!(tasks[0].starttime, 40);
+        assert_eq!(tasks[0].endtime, Some(50));
+        assert_eq!(tasks[1].starttime, 35);
+        assert_eq!(tasks[2].starttime, 30);
+        assert_eq!(tasks[3].starttime, 25);
+        assert_eq!(tasks[4].starttime, 20);
+
+        Ok(())
+    }
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 08/15] task cache: fetch every 5mins, requesting only missing tasks
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (6 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 07/15] task cache: move to its own submodule Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-31 13:42   ` Wolfgang Bumiller
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 09/15] remote tasks: return tasks in stable order Lukas Wagner
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

Instead of purging the cache completely and requesting the entire task
history over the last seven days, we keep the per-remote task history in
the cache and only fetch missing tasks since the last time we fetched
them.

If a tracked task finishes, we also force to fetch the most recent task
data, since that is the only way right now to get the task status *and*
task endtime.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/remote_tasks/mod.rs        | 52 +++++++++++++++------------
 server/src/remote_tasks/task_cache.rs | 25 +++++++++----
 2 files changed, 49 insertions(+), 28 deletions(-)

diff --git a/server/src/remote_tasks/mod.rs b/server/src/remote_tasks/mod.rs
index 4a0552c..16c46a5 100644
--- a/server/src/remote_tasks/mod.rs
+++ b/server/src/remote_tasks/mod.rs
@@ -20,6 +20,8 @@ use task_cache::TaskCache;
 
 // TODO: Does this number make sense?
 const CACHED_TASKS_PER_REMOTE: usize = 2000;
+const REFRESH_EVERY_S: i64 = 300;
+const OVERLAP_S: i64 = 60;
 
 /// Get tasks for all remotes
 // FIXME: filter for privileges
@@ -42,24 +44,32 @@ pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error>
     // a task's endtime, which is only returned by
     // /nodes/<node>/tasks...
     // Room for improvements in the future.
-    invalidate_cache_for_finished_tasks(&mut cache);
+    let force_refresh_remotes = get_remotes_with_finished_tasks();
 
+    let now = proxmox_time::epoch_i64();
     for (remote_name, remote) in &remotes.sections {
-        if let Some(tasks) = cache.get_tasks(remote_name.as_str()) {
-            // Data in cache is recent enough and has not been invalidated.
-            all_tasks.extend(tasks);
-        } else {
-            let tasks = match fetch_tasks(remote).await {
-                Ok(tasks) => tasks,
+        let last_fetched = cache.get_last_fetched(remote_name).unwrap_or(0);
+        let diff = now - last_fetched;
+
+        if diff > REFRESH_EVERY_S || diff < 0 || force_refresh_remotes.contains(remote_name) {
+            // Add some overlap so that we for sure fetch every task - duplicates
+            // are remove when adding the tasks to the cache.
+            let fetch_since =
+                (cache.get_most_recent_starttime(remote_name).unwrap_or(0) - OVERLAP_S).max(0);
+
+            match fetch_tasks(remote, fetch_since).await {
+                Ok(tasks) => {
+                    cache.add_tasks(remote_name.as_str(), tasks, now);
+                }
                 Err(err) => {
                     // ignore errors for not reachable remotes
                     continue;
                 }
-            };
-            cache.add_tasks(remote_name.as_str(), tasks.clone());
-
-            all_tasks.extend(tasks);
+            }
         }
+
+        let tasks = cache.get_tasks(remote_name).unwrap_or_default();
+        all_tasks.extend(tasks);
     }
 
     let mut returned_tasks = add_running_tasks(all_tasks)?;
@@ -125,7 +135,7 @@ pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error>
 }
 
 /// Fetch tasks (active and finished) from a remote
-async fn fetch_tasks(remote: &Remote) -> Result<Vec<TaskListItem>, Error> {
+async fn fetch_tasks(remote: &Remote, since: i64) -> Result<Vec<TaskListItem>, Error> {
     let mut tasks = Vec::new();
 
     match remote.ty {
@@ -138,9 +148,8 @@ async fn fetch_tasks(remote: &Remote) -> Result<Vec<TaskListItem>, Error> {
                 let params = ListTasks {
                     // Include running tasks
                     source: Some(ListTasksSource::All),
-                    // TODO: How much task history do we want? Right now we just hard-code it
-                    // to 7 days.
-                    since: Some(proxmox_time::epoch_i64() - 7 * 24 * 60 * 60),
+                    since: Some(since),
+                    limit: Some(CACHED_TASKS_PER_REMOTE as u64),
                     ..Default::default()
                 };
 
@@ -182,18 +191,17 @@ fn map_tasks(tasks: Vec<ListTasksResponse>, remote: &str) -> Result<Vec<TaskList
     Ok(mapped)
 }
 
-/// Drops the cached task list of a remote for all finished tasks.
+/// Get list of remotes which have finished tasks.
 ///
 /// We use this to force a refresh so that we get the full task
 /// info (including `endtime`) in the next API call.
-fn invalidate_cache_for_finished_tasks(cache: &mut TaskCache) {
+fn get_remotes_with_finished_tasks() -> HashSet<String> {
     let mut finished = FINISHED_FOREIGN_TASKS.write().expect("mutex poisoned");
 
-    // If a task is finished, we force a refresh for the remote - otherwise
-    // we don't get the 'endtime' for the task.
-    for task in finished.drain() {
-        cache.invalidate_cache_for_remote(task.remote());
-    }
+    finished
+        .drain()
+        .map(|task| task.remote().to_string())
+        .collect()
 }
 
 /// Supplement the list of tasks that we received from the remote with
diff --git a/server/src/remote_tasks/task_cache.rs b/server/src/remote_tasks/task_cache.rs
index 8a98876..e08e3d4 100644
--- a/server/src/remote_tasks/task_cache.rs
+++ b/server/src/remote_tasks/task_cache.rs
@@ -107,10 +107,11 @@ impl TaskCache {
     ///
     /// If the total number of stored tasks exceeds `max_tasks_per_remote`, the
     /// oldest ones are truncated.
-    pub(super) fn add_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>) {
+    pub(super) fn add_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>, last_fetched: i64) {
         let entry = self.content.remote_tasks.entry(remote.into()).or_default();
 
         entry.tasks = Self::merge_tasks(entry.tasks.clone(), tasks, self.max_tasks_per_remote);
+        entry.last_fetched = last_fetched;
 
         self.dirty = true;
     }
@@ -123,10 +124,21 @@ impl TaskCache {
             .map(|entry| entry.tasks.clone())
     }
 
-    // Invalidate cache for a given remote.
-    pub(super) fn invalidate_cache_for_remote(&mut self, remote: &str) {
-        self.dirty = true;
-        self.content.remote_tasks.remove(remote);
+    // Get the `last_fetched` time stamp for a given remote.
+    pub(super) fn get_last_fetched(&self, remote: &str) -> Option<i64> {
+        self.content
+            .remote_tasks
+            .get(remote)
+            .map(|entry| entry.last_fetched)
+    }
+
+    // Get the `most_recent_starttime` time stamp for a given remote.
+    pub(super) fn get_most_recent_starttime(&self, remote: &str) -> Option<i64> {
+        if let Some(entry) = self.content.remote_tasks.get(remote) {
+            Some(entry.tasks.first()?.starttime)
+        } else {
+            None
+        }
     }
 
     // Lock the cache for modification.
@@ -211,6 +223,7 @@ where
 /// Per-remote entry in the task cache.
 struct TaskCacheEntry {
     tasks: Vec<TaskListItem>,
+    last_fetched: i64,
 }
 
 #[derive(Debug, Default, Serialize, Deserialize)]
@@ -267,7 +280,7 @@ mod tests {
             tasks.push(make_upid(now - 10 * i, None, None)?);
         }
 
-        cache.add_tasks("some-remote", tasks.clone());
+        cache.add_tasks("some-remote", tasks.clone(), 0);
         cache.save()?;
 
         let cache = TaskCache::new(temp_file.path().into(), options, 50)?;
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 09/15] remote tasks: return tasks in stable order
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (7 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 08/15] task cache: fetch every 5mins, requesting only missing tasks Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 10/15] remote tasks: allow to force-fetch latest tasks Lukas Wagner
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

Since the tasks are retrieved from the cache by key and this key comes
from iterating a hashmap, the order of the tasks with the same start
time was unstable. Since we have to sort anyway, we can fix this by
simply using a tasks UPID as a secondary sorting criterium. The
UPID starts with the remote's name, so this boils down to using the
remote name as a secondary criterium.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/remote_tasks/mod.rs | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/server/src/remote_tasks/mod.rs b/server/src/remote_tasks/mod.rs
index 16c46a5..3da6f25 100644
--- a/server/src/remote_tasks/mod.rs
+++ b/server/src/remote_tasks/mod.rs
@@ -1,4 +1,5 @@
 use std::{
+    cmp::Ordering,
     collections::HashSet,
     path::Path,
     sync::{LazyLock, RwLock},
@@ -73,7 +74,15 @@ pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error>
     }
 
     let mut returned_tasks = add_running_tasks(all_tasks)?;
-    returned_tasks.sort_by(|a, b| b.starttime.cmp(&a.starttime));
+
+    // Sort merged tasks by starttime, falling back to UPID if the starttime is equal.
+    // This UPID starts with the remote name, so this equates to using the remote name
+    // as a secondary sorting criterium.
+    returned_tasks.sort_by(|a, b| match b.starttime.cmp(&a.starttime) {
+        Ordering::Equal => a.upid.cmp(&b.upid),
+        ord => ord,
+    });
+
     let returned_tasks = returned_tasks
         .into_iter()
         .filter(|item| {
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 10/15] remote tasks: allow to force-fetch latest tasks
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (8 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 09/15] remote tasks: return tasks in stable order Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 11/15] fake remote: add missing fields to make the debug feature compile again Lukas Wagner
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

This one makes sense for a manual refresh button in the remote
task view.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/remote_tasks.rs |  9 +++++++--
 server/src/remote_tasks/mod.rs | 13 ++++++++++---
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/server/src/api/remote_tasks.rs b/server/src/api/remote_tasks.rs
index 05ce366..c55e15a 100644
--- a/server/src/api/remote_tasks.rs
+++ b/server/src/api/remote_tasks.rs
@@ -24,13 +24,18 @@ const SUBDIRS: SubdirMap = &sorted!([("list", &Router::new().get(&API_METHOD_LIS
             filters: {
                 type: TaskFilters,
                 flatten: true,
+            },
+            "force-refresh": {
+                description: "Force to fetch latest task data from remotes",
+                optional: true,
+                default: false,
             }
         },
     },
 )]
 /// Get the list of tasks for all remotes.
-async fn list_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error> {
-    let tasks = remote_tasks::get_tasks(filters).await?;
+async fn list_tasks(filters: TaskFilters, force_refresh: bool) -> Result<Vec<TaskListItem>, Error> {
+    let tasks = remote_tasks::get_tasks(filters, force_refresh).await?;
 
     Ok(tasks)
 }
diff --git a/server/src/remote_tasks/mod.rs b/server/src/remote_tasks/mod.rs
index 3da6f25..171c8aa 100644
--- a/server/src/remote_tasks/mod.rs
+++ b/server/src/remote_tasks/mod.rs
@@ -26,7 +26,10 @@ const OVERLAP_S: i64 = 60;
 
 /// Get tasks for all remotes
 // FIXME: filter for privileges
-pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error> {
+pub async fn get_tasks(
+    filters: TaskFilters,
+    force_refresh: bool,
+) -> Result<Vec<TaskListItem>, Error> {
     let (remotes, _) = pdm_config::remotes::config()?;
 
     let mut all_tasks = Vec::new();
@@ -45,14 +48,18 @@ pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error>
     // a task's endtime, which is only returned by
     // /nodes/<node>/tasks...
     // Room for improvements in the future.
-    let force_refresh_remotes = get_remotes_with_finished_tasks();
+    let remotes_with_finished_tasks = get_remotes_with_finished_tasks();
 
     let now = proxmox_time::epoch_i64();
     for (remote_name, remote) in &remotes.sections {
         let last_fetched = cache.get_last_fetched(remote_name).unwrap_or(0);
         let diff = now - last_fetched;
 
-        if diff > REFRESH_EVERY_S || diff < 0 || force_refresh_remotes.contains(remote_name) {
+        let data_too_old = diff > REFRESH_EVERY_S;
+        let clock_jumped_backwards = diff < 0;
+        let forced_by_finished_task = remotes_with_finished_tasks.contains(remote_name);
+
+        if data_too_old || clock_jumped_backwards || forced_by_finished_task || force_refresh {
             // Add some overlap so that we for sure fetch every task - duplicates
             // are remove when adding the tasks to the cache.
             let fetch_since =
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 11/15] fake remote: add missing fields to make the debug feature compile again
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (9 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 10/15] remote tasks: allow to force-fetch latest tasks Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 12/15] fake remote: generate fake task data Lukas Wagner
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/test_support/fake_remote.rs | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/server/src/test_support/fake_remote.rs b/server/src/test_support/fake_remote.rs
index 374dffe..2419e69 100644
--- a/server/src/test_support/fake_remote.rs
+++ b/server/src/test_support/fake_remote.rs
@@ -40,6 +40,7 @@ impl RemoteConfig for FakeRemoteConfig {
                     nodes: Vec::new(),
                     authid: Authid::root_auth_id().clone(),
                     token: "".into(),
+                    web_url: None,
                 },
             );
 
@@ -85,6 +86,14 @@ impl ClientFactory for FakeClientFactory {
         }))
     }
 
+    fn make_pve_client_with_endpoint(
+        &self,
+        _remote: &Remote,
+        _target_endpoint: Option<&str>,
+    ) -> Result<Box<dyn PveClient + Send + Sync>, Error> {
+        bail!("not implemented")
+    }
+
     fn make_pbs_client(&self, _remote: &Remote) -> Result<Box<PbsClient>, Error> {
         bail!("not implemented")
     }
@@ -148,6 +157,8 @@ impl PveClient for FakePveClient {
                 ty: ClusterResourceType::Qemu,
                 uptime: Some(1234),
                 vmid: Some(vmid),
+                lock: None,
+                tags: None,
             });
         }
 
@@ -179,6 +190,8 @@ impl PveClient for FakePveClient {
                 ty: ClusterResourceType::Lxc,
                 uptime: Some(1234),
                 vmid: Some(vmid),
+                lock: None,
+                tags: None,
             });
         }
 
@@ -209,6 +222,8 @@ impl PveClient for FakePveClient {
                 ty: ClusterResourceType::Node,
                 uptime: Some(1234),
                 vmid: Some(vmid),
+                lock: None,
+                tags: None,
             });
         }
 
@@ -239,6 +254,8 @@ impl PveClient for FakePveClient {
                 ty: ClusterResourceType::Storage,
                 uptime: None,
                 vmid: None,
+                lock: None,
+                tags: None,
             });
         }
 
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 12/15] fake remote: generate fake task data
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (10 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 11/15] fake remote: add missing fields to make the debug feature compile again Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 13/15] task cache: tests: improve test coverage Lukas Wagner
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

This one helps us with the evaluation of the performance characteristics
of huge setups.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/test_support/fake_remote.rs | 64 +++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 2 deletions(-)

diff --git a/server/src/test_support/fake_remote.rs b/server/src/test_support/fake_remote.rs
index 2419e69..52584a0 100644
--- a/server/src/test_support/fake_remote.rs
+++ b/server/src/test_support/fake_remote.rs
@@ -6,8 +6,9 @@ use pdm_config::remotes::RemoteConfig;
 use proxmox_product_config::ApiLockGuard;
 use proxmox_section_config::typed::SectionConfigData;
 use pve_api_types::{
-    client::PveClient, ClusterMetrics, ClusterMetricsData, ClusterResource, ClusterResourceKind,
-    ClusterResourceType, StorageContent,
+    client::PveClient, ClusterMetrics, ClusterMetricsData, ClusterNodeIndexResponse,
+    ClusterNodeIndexResponseStatus, ClusterResource, ClusterResourceKind, ClusterResourceType,
+    ListTasks, ListTasksResponse, PveUpid, StorageContent,
 };
 use serde::Deserialize;
 
@@ -351,4 +352,63 @@ impl PveClient for FakePveClient {
 
         Ok(ClusterMetrics { data })
     }
+
+    async fn list_nodes(&self) -> Result<Vec<ClusterNodeIndexResponse>, proxmox_client::Error> {
+        Ok((0..self.nr_of_nodes)
+            .into_iter()
+            .map(|i| ClusterNodeIndexResponse {
+                node: format!("pve-{i}"),
+                cpu: None,
+                level: None,
+                maxcpu: None,
+                maxmem: None,
+                mem: None,
+                ssl_fingerprint: None,
+                status: ClusterNodeIndexResponseStatus::Online,
+                uptime: None,
+            })
+            .collect())
+    }
+
+    async fn get_task_list(
+        &self,
+        node: &str,
+        params: ListTasks,
+    ) -> Result<Vec<ListTasksResponse>, proxmox_client::Error> {
+        let make_task = |starttime| {
+            let endtime = Some(starttime + 4);
+
+            let upid_str =
+                format!("UPID:{node}:0000C530:001C9BEC:{starttime:08X}:stopall::root@pam:",);
+            let upid: PveUpid = upid_str.parse().unwrap();
+
+            ListTasksResponse {
+                node: node.to_string(),
+                endtime,
+                pid: upid.pid as i64,
+                pstart: upid.pstart as i64,
+                starttime,
+                status: Some("OK".to_string()),
+                ty: upid.worker_type,
+                user: upid.auth_id,
+                upid: upid_str,
+                id: upid.worker_id.unwrap_or_default(),
+            }
+        };
+
+        let limit = params.limit.unwrap_or(1500);
+        let since = params.since.unwrap_or(0);
+
+        let now = proxmox_time::epoch_i64();
+
+        // Let's fake a new task every 5 seconds
+        let number_of_tasks = (now - since) / 5;
+
+        let number_of_tasks = limit.min(number_of_tasks as u64);
+
+        Ok((0..number_of_tasks)
+            .into_iter()
+            .map(|i| make_task(now - i as i64 * 5))
+            .collect())
+    }
 }
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 13/15] task cache: tests: improve test coverage
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (11 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 12/15] fake remote: generate fake task data Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 14/15] remote tasks: fix unused variable warning Lukas Wagner
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

Adds some further trivial checks to improve coverage.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/remote_tasks/task_cache.rs | 40 +++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/server/src/remote_tasks/task_cache.rs b/server/src/remote_tasks/task_cache.rs
index e08e3d4..49164cb 100644
--- a/server/src/remote_tasks/task_cache.rs
+++ b/server/src/remote_tasks/task_cache.rs
@@ -289,6 +289,13 @@ mod tests {
         tasks.reverse();
         assert_eq!(tasks, res);
 
+        assert_eq!(cache.get_last_fetched("some-remote"), Some(0));
+        assert_eq!(cache.get_most_recent_starttime("some-remote").unwrap(), now);
+
+        assert!(cache.get_tasks("other-remote").is_none());
+        assert!(cache.get_last_fetched("other-remote").is_none());
+        assert!(cache.get_most_recent_starttime("other-remote").is_none());
+
         Ok(())
     }
 
@@ -322,4 +329,37 @@ mod tests {
 
         Ok(())
     }
+
+    #[test]
+    fn merge_tasks_2() -> Result<(), Error> {
+        // Pretty much identical to the other test, but try
+        // to trigger different edge-cases in the merge iterator.
+
+        // Arrange
+        let mut a = Vec::new();
+        for i in [30, 10, 20, 5] {
+            a.push(make_upid(i, None, None)?);
+        }
+
+        let mut b = Vec::new();
+        for i in [25, 15] {
+            b.push(make_upid(i, None, None)?);
+        }
+        a.push(make_upid(40, Some(50), Some("some status".into()))?);
+        b.push(make_upid(40, None, None)?);
+
+        // Act
+        let tasks = TaskCache::merge_tasks(a, b, 6);
+
+        // Assert
+        assert_eq!(tasks.len(), 6);
+        assert_eq!(tasks[0].starttime, 40);
+        assert_eq!(tasks[1].starttime, 30);
+        assert_eq!(tasks[2].starttime, 25);
+        assert_eq!(tasks[3].starttime, 20);
+        assert_eq!(tasks[4].starttime, 15);
+        assert_eq!(tasks[5].starttime, 10);
+
+        Ok(())
+    }
 }
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 14/15] remote tasks: fix unused variable warning
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (12 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 13/15] task cache: tests: improve test coverage Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 15/15] remote-tasks: restrict function visibility Lukas Wagner
  2025-01-31  9:35 ` [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/remote_tasks/mod.rs | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/server/src/remote_tasks/mod.rs b/server/src/remote_tasks/mod.rs
index 171c8aa..5735f2f 100644
--- a/server/src/remote_tasks/mod.rs
+++ b/server/src/remote_tasks/mod.rs
@@ -69,8 +69,10 @@ pub async fn get_tasks(
                 Ok(tasks) => {
                     cache.add_tasks(remote_name.as_str(), tasks, now);
                 }
-                Err(err) => {
-                    // ignore errors for not reachable remotes
+                Err(_err) => {
+                    // TODO: Log warning if a remote's tasks could not be fetched.
+                    // TODO: Add a mechanism to log not too often - we need that in other places as
+                    // well, so maybe come up with something generic.
                     continue;
                 }
             }
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 15/15] remote-tasks: restrict function visibility
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (13 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 14/15] remote tasks: fix unused variable warning Lukas Wagner
@ 2025-01-28 12:25 ` Lukas Wagner
  2025-01-31  9:35 ` [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
  15 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-28 12:25 UTC (permalink / raw)
  To: pdm-devel

These fuctions are never called from any other module, so we can
make them private.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/remote_tasks/mod.rs | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/server/src/remote_tasks/mod.rs b/server/src/remote_tasks/mod.rs
index 5735f2f..49789c9 100644
--- a/server/src/remote_tasks/mod.rs
+++ b/server/src/remote_tasks/mod.rs
@@ -331,7 +331,7 @@ pub fn track_running_task(task: RemoteUpid) -> Option<JoinHandle<()>> {
 /// Get a list of running foreign tasks
 ///
 /// panics on a poisoned mutex
-pub fn get_running_tasks() -> Vec<RemoteUpid> {
+fn get_running_tasks() -> Vec<RemoteUpid> {
     RUNNING_FOREIGN_TASKS
         .read()
         .unwrap()
@@ -344,7 +344,7 @@ pub fn get_running_tasks() -> Vec<RemoteUpid> {
 /// returns their upids + status
 ///
 /// panics on a poisoned mutex
-pub async fn get_finished_tasks() -> Vec<(RemoteUpid, String)> {
+async fn get_finished_tasks() -> Vec<(RemoteUpid, String)> {
     let mut finished = Vec::new();
     let config = match pdm_config::remotes::config() {
         Ok((config, _)) => config,
@@ -396,7 +396,7 @@ pub async fn get_finished_tasks() -> Vec<(RemoteUpid, String)> {
 }
 
 /// Parses a task status string into a TaskStateType
-pub fn tasktype(status: &str) -> TaskStateType {
+fn tasktype(status: &str) -> TaskStateType {
     if status == "unknown" || status.is_empty() {
         TaskStateType::Unknown
     } else if status == "OK" {
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 04/15] task cache: remove max-age machanism
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 04/15] task cache: remove max-age machanism Lukas Wagner
@ 2025-01-29 18:27   ` Thomas Lamprecht
  2025-01-30  8:01     ` Lukas Wagner
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Lamprecht @ 2025-01-29 18:27 UTC (permalink / raw)
  To: Proxmox Datacenter Manager development discussion, Lukas Wagner

Am 28.01.25 um 13:25 schrieb Lukas Wagner:
> This commit removes the time-based caching policy for remote tasks. It
> will be replaced by another cache replacement policy based on total
> number of tasks in an upcoming commit.

high-level: Such commits really should state a rationale with some
background over why this approach has to be replaced. Noting that in
the cover letter too would also be appreciated, such things help to
"sell" series/PRs and having the underlying goal and/or pain points
spelled out, even if quite obvious, ensures everyone is on the same
page.

Similar comment for the next patch adding the FIFO replacement policy,
I won't write a separate mail for that.

I will try to check out the overall picture and the points up for
discussion you mention over the next days; that should not discourage
anybody else to review it though, and please holler at me should I
need longer.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 04/15] task cache: remove max-age machanism
  2025-01-29 18:27   ` Thomas Lamprecht
@ 2025-01-30  8:01     ` Lukas Wagner
  2025-01-30 16:06       ` Thomas Lamprecht
  0 siblings, 1 reply; 26+ messages in thread
From: Lukas Wagner @ 2025-01-30  8:01 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Datacenter Manager development discussion



On  2025-01-29 19:27, Thomas Lamprecht wrote:
> Am 28.01.25 um 13:25 schrieb Lukas Wagner:
>> This commit removes the time-based caching policy for remote tasks. It
>> will be replaced by another cache replacement policy based on total
>> number of tasks in an upcoming commit.
> 
> high-level: Such commits really should state a rationale with some
> background over why this approach has to be replaced. Noting that in
> the cover letter too would also be appreciated, such things help to
> "sell" series/PRs and having the underlying goal and/or pain points
> spelled out, even if quite obvious, ensures everyone is on the same
> page.
> 
> Similar comment for the next patch adding the FIFO replacement policy,
> I won't write a separate mail for that.
> 

Ah yeah sure, sorry about that.

Dominik (and Dietmar briefly as well) suggested this approach to me and after some thoughts
I agreed this was better. Since this came from 'higher up', the 'why' was somewhat settled,
at least in my head, and I guess that's why I kinda forgot to explain it in the commit message.
Of course you are a 100% correct, the rationale should be included in the commit log for future reference.
I'll wait for further reviews and then try to expand the commit messages for a v2, if that's alright with you.

Thanks!

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 04/15] task cache: remove max-age machanism
  2025-01-30  8:01     ` Lukas Wagner
@ 2025-01-30 16:06       ` Thomas Lamprecht
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Lamprecht @ 2025-01-30 16:06 UTC (permalink / raw)
  To: Proxmox Datacenter Manager development discussion, Lukas Wagner

Am 30.01.25 um 09:01 schrieb Lukas Wagner:
> 
> 
> On  2025-01-29 19:27, Thomas Lamprecht wrote:
>> Am 28.01.25 um 13:25 schrieb Lukas Wagner:
>>> This commit removes the time-based caching policy for remote tasks. It
>>> will be replaced by another cache replacement policy based on total
>>> number of tasks in an upcoming commit.
>>
>> high-level: Such commits really should state a rationale with some
>> background over why this approach has to be replaced. Noting that in
>> the cover letter too would also be appreciated, such things help to
>> "sell" series/PRs and having the underlying goal and/or pain points
>> spelled out, even if quite obvious, ensures everyone is on the same
>> page.
>>
>> Similar comment for the next patch adding the FIFO replacement policy,
>> I won't write a separate mail for that.
>>
> 
> Ah yeah sure, sorry about that.
> 
> Dominik (and Dietmar briefly as well) suggested this approach to me and after some thoughts
> I agreed this was better. Since this came from 'higher up', the 'why' was somewhat settled,
> at least in my head, and I guess that's why I kinda forgot to explain it in the commit message.
> Of course you are a 100% correct, the rationale should be included in the commit log for future reference.
> I'll wait for further reviews and then try to expand the commit messages for a v2, if that's alright with you.

Maybe reply with some rationale to the cover-letter now or the like?
I had a quick chat with Wolfgang and the "why?" question popped up there
almost immediately. As tbh., currently, without looking all to deep and
thinking all too much about this (so good chance of oversight) I only
see one benefit: having a fixed guaranteed limit of cache size (scaling
with remote count?). And I mean sure, if this would be big data, but
it essentially should be UPID + optional endtime and status.

So if size is an issue it might make more sense to use a type that does
not duplicates as much as TaskListItem does, the UPID alone contains
already most of that information. The frontend could even split that,
it's rather trivial, but even doing that on demand in the backend would
be OK. As this is a cache and basically fully internal using a binary
format might be also an option to consider, if size is the real reason
for this here that is.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO
  2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
                   ` (14 preceding siblings ...)
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 15/15] remote-tasks: restrict function visibility Lukas Wagner
@ 2025-01-31  9:35 ` Lukas Wagner
  2025-01-31 13:36   ` Wolfgang Bumiller
  2025-02-05 15:34   ` Thomas Lamprecht
  15 siblings, 2 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-01-31  9:35 UTC (permalink / raw)
  To: pdm-devel; +Cc: Thomas Lamprecht



On  2025-01-28 13:25, Lukas Wagner wrote:
> This patch series changes the remote task caching behavior from being purely
> time-based cache to a FIFO cache-replacement policy with a maximum number of
> cached tasks per remote. If the maximum number is exceeded, the oldest tasks
> are dropped from the cache.
> 
> When calling the remote-tasks API, the latest missing task data which is not
> yet in the cache is requested from the remotes. There we limit this to once
> every 5 minutes at the moment, with the option for a force-refresh (to be triggered
> by a refresh button in the UI). As before, we augment the cached task data
> with the currently running tracked tasks which were started by PDM.
> 
> Some words about the cache storage implementation:
> Note that the storage backend for this cache probably needs some more love in
> the future. Right now its just a single JSON file for everything, mainly because this
> was the quickest approach to implement to unblock UI development work. 
> The problem with the approach is that it does not perform well with setups
> with a large number of remotes, since every update to the cache rewrites
> the entire cache file when the cache is persisted, causing additional
> CPU and IO load.
> 
> In the future, we should use a similar mechanism as the task archive in PBS.
> I'm not sure if the exact same mechanism can be used due to some different
> requirements, but the general direct probably fits quite well.
> If we can reuse it 1:1, we have to break it out of (iirc) the WorkerTask struct
> to be reusable.
> It will probably require some experimentation and benchmarking to find an
> ideal approach.
> We probably don't want a separate archive per remote, since we do not want to
> read hundres/thousands of files when we request an aggregated remote task history
> via the API. But having the complete archive in a single file also seems quite
> challenging - we need to keep the data sorted, while we also need to
> handle task data arriving out of order from different remotes. Keeping
> the data sorted when new data arrives leads us to the same problem as with
> JSON file, being that we have to rewrite the file over and over again, causing
> load and writes.
> 
> The good news is that since this is just a cache, we are pretty free to change
> the storage implementation without too much trouble; we don't even have to
> migrate the old data, since it should not be an issue to simply request the
> data from the remotes again. This is the main reason why I've opted
> to keep the JSON file for now; I or somebody else can revisit this at a later
> time.
> 

Some additional context as to explain the 'why', since @Thomas and @Wolfgang requested it:

The status quo for the task cache is to fetch a certain time-based range of tasks
(iirc the last seven days) from every remote and cache this for a certain period
of time (max-age). If the cached data is too old, we discard the task data
and fetch the same time range again.
My initial reasoning behind designing it like this was to keep the
'ground truth' completely on the remote side, so *if* somebody were to mess with
the task archive, we would be consistent after a refresh. Also this allowed to
keep the caching logic on PDM side much simpler, since we not doing much more
than caching the API response from the remote - the same way we already do it
for resources, subscription status, etc.

The downside is we had some unnecessary traffic, since we keep on fetching old tasks
that we already received.

I originally posted this as an RFC right before the holidays to get an initial approach
out to get some feedback on the approach (I wasn't too sure myself if the original approach
was a good idea) and also to somewhat unblock UI development for the remote task view.

The RFC patches were applied relatively quickly; in a short conversation with Dietmar
he mentioned that it might be better to just fetch the most recent missing tasks so
that we don't have to retransmit the old data over and over again.

Some time later Dominik also approached me and suggested the approach that is implemented in
this patch series. Which is
  - instead of simply caching the remotes API response, we try to replicate
    the task archive locally
  - memorize the time when we last got the latest tasks and only request
    what is missing since then
  - limit the replicated task archive's size, dropping the oldest tasks
    when the size is exceeded

I didn't have any objections so I went ahead and implemented it.

The main benefits is that we have to transmit much less data. 
The drawback is that the caching logic became more complex, since we e.g. have to make sure
that we don't have duplicated entries in the task logs. Also, at least in *theory*,
both could diverge if something were to change the task archive on the remote
side - not sure how much of a concern this, though.

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO
  2025-01-31  9:35 ` [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
@ 2025-01-31 13:36   ` Wolfgang Bumiller
  2025-01-31 13:51     ` Wolfgang Bumiller
  2025-02-05 15:34   ` Thomas Lamprecht
  1 sibling, 1 reply; 26+ messages in thread
From: Wolfgang Bumiller @ 2025-01-31 13:36 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel, Thomas Lamprecht

On Fri, Jan 31, 2025 at 10:35:03AM +0100, Lukas Wagner wrote:
> 
> 
> On  2025-01-28 13:25, Lukas Wagner wrote:
> > This patch series changes the remote task caching behavior from being purely
> > time-based cache to a FIFO cache-replacement policy with a maximum number of
> > cached tasks per remote. If the maximum number is exceeded, the oldest tasks
> > are dropped from the cache.
> > 
> > When calling the remote-tasks API, the latest missing task data which is not
> > yet in the cache is requested from the remotes. There we limit this to once
> > every 5 minutes at the moment, with the option for a force-refresh (to be triggered
> > by a refresh button in the UI). As before, we augment the cached task data
> > with the currently running tracked tasks which were started by PDM.
> > 
> > Some words about the cache storage implementation:
> > Note that the storage backend for this cache probably needs some more love in
> > the future. Right now its just a single JSON file for everything, mainly because this
> > was the quickest approach to implement to unblock UI development work. 
> > The problem with the approach is that it does not perform well with setups
> > with a large number of remotes, since every update to the cache rewrites
> > the entire cache file when the cache is persisted, causing additional
> > CPU and IO load.
> > 
> > In the future, we should use a similar mechanism as the task archive in PBS.
> > I'm not sure if the exact same mechanism can be used due to some different
> > requirements, but the general direct probably fits quite well.
> > If we can reuse it 1:1, we have to break it out of (iirc) the WorkerTask struct
> > to be reusable.
> > It will probably require some experimentation and benchmarking to find an
> > ideal approach.
> > We probably don't want a separate archive per remote, since we do not want to
> > read hundres/thousands of files when we request an aggregated remote task history
> > via the API. But having the complete archive in a single file also seems quite
> > challenging - we need to keep the data sorted, while we also need to
> > handle task data arriving out of order from different remotes. Keeping
> > the data sorted when new data arrives leads us to the same problem as with
> > JSON file, being that we have to rewrite the file over and over again, causing
> > load and writes.
> > 
> > The good news is that since this is just a cache, we are pretty free to change
> > the storage implementation without too much trouble; we don't even have to
> > migrate the old data, since it should not be an issue to simply request the
> > data from the remotes again. This is the main reason why I've opted
> > to keep the JSON file for now; I or somebody else can revisit this at a later
> > time.
> > 
> 
> Some additional context as to explain the 'why', since @Thomas and @Wolfgang requested it:
> 
> The status quo for the task cache is to fetch a certain time-based range of tasks
> (iirc the last seven days) from every remote and cache this for a certain period
> of time (max-age). If the cached data is too old, we discard the task data
> and fetch the same time range again.
> My initial reasoning behind designing it like this was to keep the
> 'ground truth' completely on the remote side, so *if* somebody were to mess with

I don't think "messing with the task archive" is something we want to
worry about unless it's "easy enough".

> the task archive, we would be consistent after a refresh. Also this allowed to
> keep the caching logic on PDM side much simpler, since we not doing much more
> than caching the API response from the remote - the same way we already do it
> for resources, subscription status, etc.
> 
> The downside is we had some unnecessary traffic, since we keep on fetching old tasks
> that we already received.
> 
> I originally posted this as an RFC right before the holidays to get an initial approach
> out to get some feedback on the approach (I wasn't too sure myself if the original approach
> was a good idea) and also to somewhat unblock UI development for the remote task view.

There are definitely some things which need to be improved regardless of
which version we use.
This version otherwise does look okay to me I suppose.

> 
> The RFC patches were applied relatively quickly; in a short conversation with Dietmar
> he mentioned that it might be better to just fetch the most recent missing tasks so
> that we don't have to retransmit the old data over and over again.
> 
> Some time later Dominik also approached me and suggested the approach that is implemented in
> this patch series. Which is
>   - instead of simply caching the remotes API response, we try to replicate
>     the task archive locally
>   - memorize the time when we last got the latest tasks and only request
>     what is missing since then
>   - limit the replicated task archive's size, dropping the oldest tasks
>     when the size is exceeded
> 
> I didn't have any objections so I went ahead and implemented it.
> 
> The main benefits is that we have to transmit much less data. 
> The drawback is that the caching logic became more complex, since we e.g. have to make sure
> that we don't have duplicated entries in the task logs. Also, at least in *theory*,
> both could diverge if something were to change the task archive on the remote
> side - not sure how much of a concern this, though.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 08/15] task cache: fetch every 5mins, requesting only missing tasks
  2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 08/15] task cache: fetch every 5mins, requesting only missing tasks Lukas Wagner
@ 2025-01-31 13:42   ` Wolfgang Bumiller
  0 siblings, 0 replies; 26+ messages in thread
From: Wolfgang Bumiller @ 2025-01-31 13:42 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Tue, Jan 28, 2025 at 01:25:13PM +0100, Lukas Wagner wrote:
> Instead of purging the cache completely and requesting the entire task
> history over the last seven days, we keep the per-remote task history in
> the cache and only fetch missing tasks since the last time we fetched
> them.
> 
> If a tracked task finishes, we also force to fetch the most recent task
> data, since that is the only way right now to get the task status *and*
> task endtime.
> 
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  server/src/remote_tasks/mod.rs        | 52 +++++++++++++++------------
>  server/src/remote_tasks/task_cache.rs | 25 +++++++++----
>  2 files changed, 49 insertions(+), 28 deletions(-)
> 
> diff --git a/server/src/remote_tasks/mod.rs b/server/src/remote_tasks/mod.rs
> index 4a0552c..16c46a5 100644
> --- a/server/src/remote_tasks/mod.rs
> +++ b/server/src/remote_tasks/mod.rs
> @@ -20,6 +20,8 @@ use task_cache::TaskCache;
>  
>  // TODO: Does this number make sense?
>  const CACHED_TASKS_PER_REMOTE: usize = 2000;
> +const REFRESH_EVERY_S: i64 = 300;
> +const OVERLAP_S: i64 = 60;
>  
>  /// Get tasks for all remotes
>  // FIXME: filter for privileges
> @@ -42,24 +44,32 @@ pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error>
>      // a task's endtime, which is only returned by
>      // /nodes/<node>/tasks...
>      // Room for improvements in the future.
> -    invalidate_cache_for_finished_tasks(&mut cache);
> +    let force_refresh_remotes = get_remotes_with_finished_tasks();
>  
> +    let now = proxmox_time::epoch_i64();
>      for (remote_name, remote) in &remotes.sections {
> -        if let Some(tasks) = cache.get_tasks(remote_name.as_str()) {
> -            // Data in cache is recent enough and has not been invalidated.
> -            all_tasks.extend(tasks);
> -        } else {
> -            let tasks = match fetch_tasks(remote).await {
> -                Ok(tasks) => tasks,
> +        let last_fetched = cache.get_last_fetched(remote_name).unwrap_or(0);
> +        let diff = now - last_fetched;
> +
> +        if diff > REFRESH_EVERY_S || diff < 0 || force_refresh_remotes.contains(remote_name) {
> +            // Add some overlap so that we for sure fetch every task - duplicates
> +            // are remove when adding the tasks to the cache.
> +            let fetch_since =
> +                (cache.get_most_recent_starttime(remote_name).unwrap_or(0) - OVERLAP_S).max(0);
> +
> +            match fetch_tasks(remote, fetch_since).await {
> +                Ok(tasks) => {
> +                    cache.add_tasks(remote_name.as_str(), tasks, now);
> +                }
>                  Err(err) => {
>                      // ignore errors for not reachable remotes
>                      continue;
>                  }
> -            };
> -            cache.add_tasks(remote_name.as_str(), tasks.clone());
> -
> -            all_tasks.extend(tasks);
> +            }
>          }
> +
> +        let tasks = cache.get_tasks(remote_name).unwrap_or_default();
> +        all_tasks.extend(tasks);
>      }
>  
>      let mut returned_tasks = add_running_tasks(all_tasks)?;
> @@ -125,7 +135,7 @@ pub async fn get_tasks(filters: TaskFilters) -> Result<Vec<TaskListItem>, Error>
>  }
>  
>  /// Fetch tasks (active and finished) from a remote
> -async fn fetch_tasks(remote: &Remote) -> Result<Vec<TaskListItem>, Error> {
> +async fn fetch_tasks(remote: &Remote, since: i64) -> Result<Vec<TaskListItem>, Error> {
>      let mut tasks = Vec::new();
>  
>      match remote.ty {
> @@ -138,9 +148,8 @@ async fn fetch_tasks(remote: &Remote) -> Result<Vec<TaskListItem>, Error> {
>                  let params = ListTasks {
>                      // Include running tasks
>                      source: Some(ListTasksSource::All),
> -                    // TODO: How much task history do we want? Right now we just hard-code it
> -                    // to 7 days.
> -                    since: Some(proxmox_time::epoch_i64() - 7 * 24 * 60 * 60),
> +                    since: Some(since),
> +                    limit: Some(CACHED_TASKS_PER_REMOTE as u64),
>                      ..Default::default()
>                  };
>  
> @@ -182,18 +191,17 @@ fn map_tasks(tasks: Vec<ListTasksResponse>, remote: &str) -> Result<Vec<TaskList
>      Ok(mapped)
>  }
>  
> -/// Drops the cached task list of a remote for all finished tasks.
> +/// Get list of remotes which have finished tasks.
>  ///
>  /// We use this to force a refresh so that we get the full task
>  /// info (including `endtime`) in the next API call.
> -fn invalidate_cache_for_finished_tasks(cache: &mut TaskCache) {
> +fn get_remotes_with_finished_tasks() -> HashSet<String> {
>      let mut finished = FINISHED_FOREIGN_TASKS.write().expect("mutex poisoned");

Already mentioned this off list (since I also had looked at the old code
the first time just this week):
This could just use
    let finished = std::mem::take(&mut *FINISHED_...lock().unwrap());
to immediately unlock the mutex and use `into_iter()` instead of
`drain()` which should be more efficient.

>  
> -    // If a task is finished, we force a refresh for the remote - otherwise
> -    // we don't get the 'endtime' for the task.
> -    for task in finished.drain() {
> -        cache.invalidate_cache_for_remote(task.remote());
> -    }
> +    finished
> +        .drain()
> +        .map(|task| task.remote().to_string())
> +        .collect()
>  }
>  
>  /// Supplement the list of tasks that we received from the remote with
> diff --git a/server/src/remote_tasks/task_cache.rs b/server/src/remote_tasks/task_cache.rs
> index 8a98876..e08e3d4 100644
> --- a/server/src/remote_tasks/task_cache.rs
> +++ b/server/src/remote_tasks/task_cache.rs
> @@ -107,10 +107,11 @@ impl TaskCache {
>      ///
>      /// If the total number of stored tasks exceeds `max_tasks_per_remote`, the
>      /// oldest ones are truncated.
> -    pub(super) fn add_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>) {
> +    pub(super) fn add_tasks(&mut self, remote: &str, tasks: Vec<TaskListItem>, last_fetched: i64) {
>          let entry = self.content.remote_tasks.entry(remote.into()).or_default();
>  
>          entry.tasks = Self::merge_tasks(entry.tasks.clone(), tasks, self.max_tasks_per_remote);

`entry.tasks = something(entry.tasks.clone())` seems wasteful, this is
also solved by `take()`

    entry.tasks = Self::merge_tasks(take(&mut entry.tasks), ...);

> +        entry.last_fetched = last_fetched;
>  
>          self.dirty = true;
>      }
> @@ -123,10 +124,21 @@ impl TaskCache {
>              .map(|entry| entry.tasks.clone())
>      }
>  
> -    // Invalidate cache for a given remote.
> -    pub(super) fn invalidate_cache_for_remote(&mut self, remote: &str) {
> -        self.dirty = true;
> -        self.content.remote_tasks.remove(remote);
> +    // Get the `last_fetched` time stamp for a given remote.
> +    pub(super) fn get_last_fetched(&self, remote: &str) -> Option<i64> {
> +        self.content
> +            .remote_tasks
> +            .get(remote)
> +            .map(|entry| entry.last_fetched)
> +    }
> +
> +    // Get the `most_recent_starttime` time stamp for a given remote.
> +    pub(super) fn get_most_recent_starttime(&self, remote: &str) -> Option<i64> {
> +        if let Some(entry) = self.content.remote_tasks.get(remote) {
> +            Some(entry.tasks.first()?.starttime)
> +        } else {
> +            None
> +        }
>      }
>  
>      // Lock the cache for modification.
> @@ -211,6 +223,7 @@ where
>  /// Per-remote entry in the task cache.
>  struct TaskCacheEntry {
>      tasks: Vec<TaskListItem>,
> +    last_fetched: i64,
>  }
>  
>  #[derive(Debug, Default, Serialize, Deserialize)]
> @@ -267,7 +280,7 @@ mod tests {
>              tasks.push(make_upid(now - 10 * i, None, None)?);
>          }
>  
> -        cache.add_tasks("some-remote", tasks.clone());
> +        cache.add_tasks("some-remote", tasks.clone(), 0);
>          cache.save()?;
>  
>          let cache = TaskCache::new(temp_file.path().into(), options, 50)?;
> -- 
> 2.39.5


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO
  2025-01-31 13:36   ` Wolfgang Bumiller
@ 2025-01-31 13:51     ` Wolfgang Bumiller
  0 siblings, 0 replies; 26+ messages in thread
From: Wolfgang Bumiller @ 2025-01-31 13:51 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel, Thomas Lamprecht

On Fri, Jan 31, 2025 at 02:36:07PM +0100, Wolfgang Bumiller wrote:
> On Fri, Jan 31, 2025 at 10:35:03AM +0100, Lukas Wagner wrote:
> > 
> > 
> > On  2025-01-28 13:25, Lukas Wagner wrote:
> > > This patch series changes the remote task caching behavior from being purely
> > > time-based cache to a FIFO cache-replacement policy with a maximum number of
> > > cached tasks per remote. If the maximum number is exceeded, the oldest tasks
> > > are dropped from the cache.
> > > 
> > > When calling the remote-tasks API, the latest missing task data which is not
> > > yet in the cache is requested from the remotes. There we limit this to once
> > > every 5 minutes at the moment, with the option for a force-refresh (to be triggered
> > > by a refresh button in the UI). As before, we augment the cached task data
> > > with the currently running tracked tasks which were started by PDM.
> > > 
> > > Some words about the cache storage implementation:
> > > Note that the storage backend for this cache probably needs some more love in
> > > the future. Right now its just a single JSON file for everything, mainly because this
> > > was the quickest approach to implement to unblock UI development work. 
> > > The problem with the approach is that it does not perform well with setups
> > > with a large number of remotes, since every update to the cache rewrites
> > > the entire cache file when the cache is persisted, causing additional
> > > CPU and IO load.
> > > 
> > > In the future, we should use a similar mechanism as the task archive in PBS.
> > > I'm not sure if the exact same mechanism can be used due to some different
> > > requirements, but the general direct probably fits quite well.
> > > If we can reuse it 1:1, we have to break it out of (iirc) the WorkerTask struct
> > > to be reusable.
> > > It will probably require some experimentation and benchmarking to find an
> > > ideal approach.
> > > We probably don't want a separate archive per remote, since we do not want to
> > > read hundres/thousands of files when we request an aggregated remote task history
> > > via the API. But having the complete archive in a single file also seems quite
> > > challenging - we need to keep the data sorted, while we also need to
> > > handle task data arriving out of order from different remotes. Keeping
> > > the data sorted when new data arrives leads us to the same problem as with
> > > JSON file, being that we have to rewrite the file over and over again, causing
> > > load and writes.
> > > 
> > > The good news is that since this is just a cache, we are pretty free to change
> > > the storage implementation without too much trouble; we don't even have to
> > > migrate the old data, since it should not be an issue to simply request the
> > > data from the remotes again. This is the main reason why I've opted
> > > to keep the JSON file for now; I or somebody else can revisit this at a later
> > > time.
> > > 
> > 
> > Some additional context as to explain the 'why', since @Thomas and @Wolfgang requested it:
> > 
> > The status quo for the task cache is to fetch a certain time-based range of tasks
> > (iirc the last seven days) from every remote and cache this for a certain period
> > of time (max-age). If the cached data is too old, we discard the task data
> > and fetch the same time range again.
> > My initial reasoning behind designing it like this was to keep the
> > 'ground truth' completely on the remote side, so *if* somebody were to mess with
> 
> I don't think "messing with the task archive" is something we want to
> worry about unless it's "easy enough".
> 
> > the task archive, we would be consistent after a refresh. Also this allowed to
> > keep the caching logic on PDM side much simpler, since we not doing much more
> > than caching the API response from the remote - the same way we already do it
> > for resources, subscription status, etc.
> > 
> > The downside is we had some unnecessary traffic, since we keep on fetching old tasks
> > that we already received.
> > 
> > I originally posted this as an RFC right before the holidays to get an initial approach
> > out to get some feedback on the approach (I wasn't too sure myself if the original approach
> > was a good idea) and also to somewhat unblock UI development for the remote task view.
> 
> There are definitely some things which need to be improved regardless of
> which version we use.

Since this does not fit into the current patches as it is already an
issue in the original code:

Also mentioned this off list, but this is something general to remember:

Whenever we `spawn()` a longer running task we also need to take into
account the possibility that the daemons might be reloaded, where these
tasks would prevent the original one from shutting down (and potentially
race against file locks in the reloaded one), so they should `select()`
on a `proxmox_daemon::shutdown_future()`.

For this case, this means the list of tasks which are currently being
polled in futures needs to be persisted somewhere so the reloaded daemon
can pick it up and continue the polling while the polling task in the
old daemon just gets cancelled. AFAICT this should be fairly
unproblematic here.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO
  2025-01-31  9:35 ` [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
  2025-01-31 13:36   ` Wolfgang Bumiller
@ 2025-02-05 15:34   ` Thomas Lamprecht
  2025-02-06 10:13     ` Lukas Wagner
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Lamprecht @ 2025-02-05 15:34 UTC (permalink / raw)
  To: Proxmox Datacenter Manager development discussion

Am 31.01.25 um 10:35 schrieb Lukas Wagner:
> Some additional context as to explain the 'why', since @Thomas and @Wolfgang requested it:
> 
> The status quo for the task cache is to fetch a certain time-based range of tasks
> (iirc the last seven days) from every remote and cache this for a certain period
> of time (max-age). If the cached data is too old, we discard the task data
> and fetch the same time range again.
> My initial reasoning behind designing it like this was to keep the
> 'ground truth' completely on the remote side, so *if* somebody were to mess with

Yeah, I agree with Wolfgang here, nothing in our API allows messing with that,
neither does a first-class PVE CLI command or the like. If we want to hedge
against that we basically cannot cache anything at all, rendering them rather
useless, as even RRD metrics could be altered.

Using the start-time of the newest available task from the cache as (inclusive)
boundary to request newer tasks that happened since the last query would be
enough or?

Pruning older tasks from the cache after some max-time or max-size limit
is overstepped would then make sense though.

IMO it for purging older task log entries to avoid huge cache we should mainly
focus on the age of entries limit, e.g. most our dashboards will focus on showing
the task results since X hours, where a sensible minimum for X is probably at
least 24 hours, probably better more like 3 to 5 days to be able to (quickly)
see what happened over a weekend, maybe with some holiday attached to it.

A generously high size limit as additional upper bound to avoid running out of
space might still be nice to hedge against some node, user or api-tooling going
crazy and producing a huge amount of tasks.

> the task archive, we would be consistent after a refresh. Also this allowed to
> keep the caching logic on PDM side much simpler, since we not doing much more
> than caching the API response from the remote - the same way we already do it
> for resources, subscription status, etc.
> 
> The downside is we had some unnecessary traffic, since we keep on fetching old tasks
> that we already received.
> 
> I originally posted this as an RFC right before the holidays to get an initial approach
> out to get some feedback on the approach (I wasn't too sure myself if the original approach
> was a good idea) and also to somewhat unblock UI development for the remote task view.
>
> The RFC patches were applied relatively quickly; in a short conversation with Dietmar
> he mentioned that it might be better to just fetch the most recent missing tasks so
> that we don't have to retransmit the old data over and over again.

Yeah, I'm still not convinced that rushing such things through, especially
at RFC stage, will make overall development go quicker. Keeping communication
(solely) locked in isolated off-list discussion isn't helping for sure though.
Here both of that caused IMO rather more overhead.

> Some time later Dominik also approached me and suggested the approach that is implemented in
> this patch series. Which is
>   - instead of simply caching the remotes API response, we try to replicate
>     the task archive locally
>   - memorize the time when we last got the latest tasks and only request
>     what is missing since then
>   - limit the replicated task archive's size, dropping the oldest tasks
>     when the size is exceeded
> 
> I didn't have any objections so I went ahead and implemented it.
> 
> The main benefits is that we have to transmit much less data.

Yes, but the original idea would act the same in this regard if it did not
hedge against things that cannot really happen without manually messing around
(at which point all bets are off); or what am I missing here? To be fair: I
only skimmed the patches and asked Wolfgang a bit about it as he looked through
them more closely, so I might indeed miss something or just be slightly confused
about the presentation here.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO
  2025-02-05 15:34   ` Thomas Lamprecht
@ 2025-02-06 10:13     ` Lukas Wagner
  2025-02-12  9:19       ` Thomas Lamprecht
  0 siblings, 1 reply; 26+ messages in thread
From: Lukas Wagner @ 2025-02-06 10:13 UTC (permalink / raw)
  To: Proxmox Datacenter Manager development discussion, Thomas Lamprecht

On  2025-02-05 16:34, Thomas Lamprecht wrote:
> Am 31.01.25 um 10:35 schrieb Lukas Wagner:
>> Some additional context as to explain the 'why', since @Thomas and @Wolfgang requested it:
>>
>> The status quo for the task cache is to fetch a certain time-based range of tasks
>> (iirc the last seven days) from every remote and cache this for a certain period
>> of time (max-age). If the cached data is too old, we discard the task data
>> and fetch the same time range again.
>> My initial reasoning behind designing it like this was to keep the
>> 'ground truth' completely on the remote side, so *if* somebody were to mess with
> 
> Yeah, I agree with Wolfgang here, nothing in our API allows messing with that,
> neither does a first-class PVE CLI command or the like. If we want to hedge
> against that we basically cannot cache anything at all, rendering them rather
> useless, as even RRD metrics could be altered.
> 
> Using the start-time of the newest available task from the cache as (inclusive)
> boundary to request newer tasks that happened since the last query would be
> enough or?
> 
> Pruning older tasks from the cache after some max-time or max-size limit
> is overstepped would then make sense though.

I mean yes, this is pretty much what I implemented in this patch series.
We keep track of the most recent timestamp of a task that we have already in the cache
and only request what is missing in future refreshes. At the moment, we we limit
the cache by its size, discarding the oldest entries when the max-size is exceeded.

> IMO it for purging older task log entries to avoid huge cache we should mainly
> focus on the age of entries limit, e.g. most our dashboards will focus on showing
> the task results since X hours, where a sensible minimum for X is probably at
> least 24 hours, probably better more like 3 to 5 days to be able to (quickly)
> see what happened over a weekend, maybe with some holiday attached to it.

Sure, if you want, I can change it to use time-based approach with a generous maximum size.
Should be pretty simple change to incorporate.

> 
> A generously high size limit as additional upper bound to avoid running out of
> space might still be nice to hedge against some node, user or api-tooling going
> crazy and producing a huge amount of tasks.
> 
>> the task archive, we would be consistent after a refresh. Also this allowed to
>> keep the caching logic on PDM side much simpler, since we not doing much more
>> than caching the API response from the remote - the same way we already do it
>> for resources, subscription status, etc.
>>
>> The downside is we had some unnecessary traffic, since we keep on fetching old tasks
>> that we already received.
>>
>> I originally posted this as an RFC right before the holidays to get an initial approach
>> out to get some feedback on the approach (I wasn't too sure myself if the original approach
>> was a good idea) and also to somewhat unblock UI development for the remote task view.
>>
>> The RFC patches were applied relatively quickly; in a short conversation with Dietmar
>> he mentioned that it might be better to just fetch the most recent missing tasks so
>> that we don't have to retransmit the old data over and over again.
> 
> Yeah, I'm still not convinced that rushing such things through, especially
> at RFC stage, will make overall development go quicker. Keeping communication
> (solely) locked in isolated off-list discussion isn't helping for sure though.
> Here both of that caused IMO rather more overhead.
> 
>> Some time later Dominik also approached me and suggested the approach that is implemented in
>> this patch series. Which is
>>   - instead of simply caching the remotes API response, we try to replicate
>>     the task archive locally
>>   - memorize the time when we last got the latest tasks and only request
>>     what is missing since then
>>   - limit the replicated task archive's size, dropping the oldest tasks
>>     when the size is exceeded
>>
>> I didn't have any objections so I went ahead and implemented it.
>>
>> The main benefits is that we have to transmit much less data.
> 
> Yes, but the original idea would act the same in this regard if it did not
> hedge against things that cannot really happen without manually messing around
> (at which point all bets are off); or what am I missing here? To be fair: I
> only skimmed the patches and asked Wolfgang a bit about it as he looked through
> them more closely, so I might indeed miss something or just be slightly confused
> about the presentation here.

The initial approach which was merged too soon had some flaws which were kindly pointed out to me off-list
with some ideas for improvement.
Essentially, the key point of *this* series is to transform the original approach in a way which
does not 'hedge against things that cannot really happen'.

I'm sorry about any confusion here. I'll try to provide more context from the start if something similar
happens again. Also, I'll try to communicate my questions/doubts (like I said, I wasn't sure if the initial
approach from the RFC was good or not, hence the RFC in the first place) more openly if I post an RFC in a
similar way again, e.g. in a cover letter, to make sure that a proper, on-list discussion takes place before
anything is merged.

Thanks for your input on this.

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO
  2025-02-06 10:13     ` Lukas Wagner
@ 2025-02-12  9:19       ` Thomas Lamprecht
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Lamprecht @ 2025-02-12  9:19 UTC (permalink / raw)
  To: Lukas Wagner, Proxmox Datacenter Manager development discussion

Am 06.02.25 um 11:13 schrieb Lukas Wagner:
> On  2025-02-05 16:34, Thomas Lamprecht wrote:
>> Am 31.01.25 um 10:35 schrieb Lukas Wagner:
>>> Some additional context as to explain the 'why', since @Thomas and @Wolfgang requested it:
>>>
>>> The status quo for the task cache is to fetch a certain time-based range of tasks
>>> (iirc the last seven days) from every remote and cache this for a certain period
>>> of time (max-age). If the cached data is too old, we discard the task data
>>> and fetch the same time range again.
>>> My initial reasoning behind designing it like this was to keep the
>>> 'ground truth' completely on the remote side, so *if* somebody were to mess with
>>
>> Yeah, I agree with Wolfgang here, nothing in our API allows messing with that,
>> neither does a first-class PVE CLI command or the like. If we want to hedge
>> against that we basically cannot cache anything at all, rendering them rather
>> useless, as even RRD metrics could be altered.
>>
>> Using the start-time of the newest available task from the cache as (inclusive)
>> boundary to request newer tasks that happened since the last query would be
>> enough or?
>>
>> Pruning older tasks from the cache after some max-time or max-size limit
>> is overstepped would then make sense though.
> 
> I mean yes, this is pretty much what I implemented in this patch series.

Ack, then I probably was shoved a bit too much towards a misinterpretation to
early on, that could have been certainly avoided on my side too, but with
lots of change sets "in the air" reading subjects to get the basic big picture
becomes inevitable, at least for me. E.g., to not just "complain" and actually
provide one possible example that would have helped me here:

task cache: leverage existing cache for reducing load on updates using a FIFO max-size strategy

But no biggie, I could have checked more closely myself and was a bit put off
in general by something that isn't really you to blame for (see at the end). 

>> IMO it for purging older task log entries to avoid huge cache we should mainly
>> focus on the age of entries limit, e.g. most our dashboards will focus on showing
>> the task results since X hours, where a sensible minimum for X is probably at
>> least 24 hours, probably better more like 3 to 5 days to be able to (quickly)
>> see what happened over a weekend, maybe with some holiday attached to it.
> 
> Sure, if you want, I can change it to use time-based approach with a generous maximum size.
> Should be pretty simple change to incorporate.

One compromise might be using time for the minimum and size for the maximum.
E.g., say keep at least something like 48h independent of size, after that
keep until the size is bigger than a few tens of MiB; but just to put out an
idea.

Most reduces task entries should be below 256 bytes (more like 128, but lets
keep some headroom), and only higher if there's a high failure rate, as then
the state entries will be bigger. With a 10 MiB limit per remote and considering
an average of 256 bytes per task we would still be able to hold all tasks for
a whole week if the remote (cluster) would produce ~4 task logs per minute
(10 * 2^20 / 256 / 24 / 60 / 7 = 4.06) – so that might be enough for starters.

W.r.t. splitting this up or not I can see advantages for both, an alternative
to have a specific file per remote might be to keep all in one but rotate them
per day; that might be then useful for pruning older entries from the cache,
e.g. delete files as long as the total size is above the maximum size but there
are still at least two days (for a 48h minimum) are left over.
This way one could still relatively easily process either global or per-remote
task state, as passing a time limit to the backend is probably the most useful
in practice, as it can be clearly communicated to the user and should be useful
for when looking into problems in practice, where it's normally known when
roughly it happened, but not at which absolute count of tasks it did.

I hope I made some sense here.

>> Yes, but the original idea would act the same in this regard if it did not
>> hedge against things that cannot really happen without manually messing around
>> (at which point all bets are off); or what am I missing here? To be fair: I
>> only skimmed the patches and asked Wolfgang a bit about it as he looked through
>> them more closely, so I might indeed miss something or just be slightly confused
>> about the presentation here.
> 
> The initial approach which was merged too soon had some flaws which were kindly pointed out to me off-list
> with some ideas for improvement.
> Essentially, the key point of *this* series is to transform the original approach in a way which
> does not 'hedge against things that cannot really happen'.

Maybe I should have used s/cannot really happen/we do not care for that to
happen, as that if it happens someone messed around manually/ for clarity,
but thanks for explicitly stating that with this series we effectively go
that route either way its called.

> 
> I'm sorry about any confusion here. I'll try to provide more context from the start if something similar
> happens again. Also, I'll try to communicate my questions/doubts (like I said, I wasn't sure if the initial
> approach from the RFC was good or not, hence the RFC in the first place) more openly if I post an RFC in a
> similar way again, e.g. in a cover letter, to make sure that a proper, on-list discussion takes place before
> anything is merged.

Thank you for your explanations here, just to summarize/clarify:
I was mostly put a bit off with how it was merged without any hint about
follow-up potential on the list (and that pattern was repeated already a
few times), you are certainly not to blame for that, and while another
presentation of this series might have helped, that part really wasn't the
problem for me. And while it might be a bit much to ask for, so mentioning
it mainly for the sake of completeness: I'd appreciate a summary independent
of which side (off-list reviewer or author) it comes from for such off-list
discussion – I do not want to discourage off-list talks, they are indeed
often very useful and more efficient to have, especially as long as the
travel time between offices is in the single-digit minute range; and if all
involved take part of that the summary (mail) of the discussion might not
be needed, but probably still wouldn't hurt. I hope I got enough nuance in
there now.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-02-12  9:19 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-28 12:25 [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 01/15] pdm-api-types: derive Debug and PartialEq for TaskListItem Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 02/15] test support: add NamedTempFile helper Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 03/15] task cache: add basic test for TaskCache Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 04/15] task cache: remove max-age machanism Lukas Wagner
2025-01-29 18:27   ` Thomas Lamprecht
2025-01-30  8:01     ` Lukas Wagner
2025-01-30 16:06       ` Thomas Lamprecht
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 05/15] task cache: add FIFO cache replacement policy Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 06/15] remote tasks: move to dir based module Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 07/15] task cache: move to its own submodule Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 08/15] task cache: fetch every 5mins, requesting only missing tasks Lukas Wagner
2025-01-31 13:42   ` Wolfgang Bumiller
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 09/15] remote tasks: return tasks in stable order Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 10/15] remote tasks: allow to force-fetch latest tasks Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 11/15] fake remote: add missing fields to make the debug feature compile again Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 12/15] fake remote: generate fake task data Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 13/15] task cache: tests: improve test coverage Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 14/15] remote tasks: fix unused variable warning Lukas Wagner
2025-01-28 12:25 ` [pdm-devel] [PATCH proxmox-datacenter-manager 15/15] remote-tasks: restrict function visibility Lukas Wagner
2025-01-31  9:35 ` [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO Lukas Wagner
2025-01-31 13:36   ` Wolfgang Bumiller
2025-01-31 13:51     ` Wolfgang Bumiller
2025-02-05 15:34   ` Thomas Lamprecht
2025-02-06 10:13     ` Lukas Wagner
2025-02-12  9:19       ` Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal