public inbox for pdm-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI)
@ 2025-02-11 12:05 Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 01/25] test support: add NamedTempFile helper Lukas Wagner
                   ` (25 more replies)
  0 siblings, 26 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

Key points:
- fetch metrics concurrently
- configuration for metric collection
  - new config /etc/proxmox-datacenter-manager/metric-collection.json
      - max-concurrency (number of allowed parallel connections)
      - collection-interval
      - randomized offset for collection start
         (min-interval-offset..max-interval-offset)
      - randomized per-connection delay
         (max-connection-delay..max-connection-delay)
- Add some tests for the core logic in the metric collection system
- Allow to trigger metric collection via the API
- Record metric collection statistics in the RRD
  - overall collection time for all remotes
  - per remote response time when fetching metrics
- Persist metric collection state to disk:
  /var/lib/proxmox-datacenter-manager/metric-collection-state.json
  (timestamps of last collection, errors)
- Trigger metric collection for any new remotes added via the API

- Add new API endpoints
	POST     /metric-collection/trigger with optional 'remote' param
	GET      /metric-collection/status
	GET/PUT  /config/metric-collection/default
	GET      /remotes/<remote>/metric-collection-rrddata
	GET      /metric-collection/rrddata

- Add CLI tooling
	proxmox-datacenter-client metric-collection show-settings
	proxmox-datacenter-client metric-collection update-settings
	proxmox-datacenter-client metric-collection trigger [--remote <remote>]
	proxmox-datacenter-client metric-collection status


## To reviewers / open questions:
- Please review the defaults I've chosen for the settings, especially
  the ones for the default metric collection interval (10 minutes) as
  well as max-concurrency (10).
  I also kindly ask to double-check the naming of the properties.
  See "pdm-api-types: add CollectionSettings type" for details

- Please review path and params for new API endpoints (anything public
  facing that is hard to change later)

- I've chosen a section-config config now, even though we only have a
  single section for now. This was done for future-proofing reasons,
  maybe we want to add support for different setting 'groups' or
  something, e.g. to have different settings for distinct sets of
  remotes. Does this make sense?
  Or should I just stick to a simple config for now? (At moments like
  these I wish for TOML configs where we could be a bit more flexible...)

	collection-settings: default
	    max-concurrency 10
	    collection-interval 180
	    min-interval-offset 0
	    max-interval-offset 20
	    min-connection-delay 10
	    max-connection-delay 100


- Should `GET /remotes/<remote>/metric-collection-rrddata` be 
  just `rrddata`?
  not sure if we are going to add any other PDM-native per-remote
  metrics and whether we want to return that from the same API call
  as this...

## Random offset/delay examples
Example with 'max-concurrency' = 3 and 6 remotes.

    X ... timer triggered
    [ A ] .... fetching remote 'A'
    **** .... interval-offset     (usually a couple of seconds)
    #### .... random worker delay (usually in millisecond range)

                         /--########[  B    ] ### [  C  ]--\
                        /---####[  A  ] ###### [ D ]--------\
----X ************* ---/ ---###### [  E  ] #########[  F  ]--\----

proxmox-datacenter-manager:

Lukas Wagner (25):
  test support: add NamedTempFile helper
  test support: add NamedTempDir helper
  pdm-api-types: add CollectionSettings type
  pdm-config: add functions for reading/writing metric collection
    settings
  metric collection: split top_entities split into separate module
  metric collection: save metric data to RRD in separate task
  metric collection: rework metric poll task
  metric collection: persist state after metric collection
  metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
  metric collection: collect overdue metrics on startup/timer change
  metric collection: add tests for the fetch_remotes function
  metric collection: add test for fetch_overdue
  metric collection: pass rrd cache instance as function parameter
  metric collection: add test for rrd task
  metric collection: wrap rrd_cache::Cache in a struct
  metric collection: record remote response time in metric database
  metric collection: save time needed for collection run to RRD
  metric collection: periodically clean removed remotes from statefile
  api: add endpoint for updating metric collection settings
  api: add endpoint to trigger metric collection
  api: remotes: trigger immediate metric collection for newly added
    nodes
  api: add api for querying metric collection RRD data
  api: metric-collection: add status endpoint
  pdm-client: add metric collection API methods
  cli: add commands for metric-collection settings, trigger, status

 Cargo.toml                                    |   1 +
 cli/client/Cargo.toml                         |   1 +
 cli/client/src/main.rs                        |   2 +
 cli/client/src/metric_collection.rs           | 164 ++++
 lib/pdm-api-types/src/lib.rs                  |   3 +
 lib/pdm-api-types/src/metric_collection.rs    | 195 +++++
 lib/pdm-api-types/src/rrddata.rs              |  26 +
 lib/pdm-client/src/lib.rs                     |  87 ++
 lib/pdm-config/src/lib.rs                     |   1 +
 lib/pdm-config/src/metric_collection.rs       |  69 ++
 server/Cargo.toml                             |   1 +
 server/src/api/config/metric_collection.rs    | 166 ++++
 server/src/api/config/mod.rs                  |   2 +
 server/src/api/metric_collection.rs           |  99 +++
 server/src/api/mod.rs                         |   2 +
 server/src/api/remotes.rs                     |  59 ++
 server/src/api/resources.rs                   |   3 +-
 server/src/api/rrd_common.rs                  |  11 +-
 server/src/bin/proxmox-datacenter-api.rs      |   2 +-
 server/src/lib.rs                             |   2 +-
 .../src/metric_collection/collection_task.rs  | 756 ++++++++++++++++++
 server/src/metric_collection/mod.rs           | 333 ++------
 server/src/metric_collection/rrd_cache.rs     | 204 ++---
 server/src/metric_collection/rrd_task.rs      | 286 +++++++
 server/src/metric_collection/state.rs         | 152 ++++
 server/src/metric_collection/top_entities.rs  | 150 ++++
 server/src/test_support/mod.rs                |   4 +
 server/src/test_support/temp.rs               |  60 ++
 28 files changed, 2479 insertions(+), 362 deletions(-)
 create mode 100644 cli/client/src/metric_collection.rs
 create mode 100644 lib/pdm-api-types/src/metric_collection.rs
 create mode 100644 lib/pdm-config/src/metric_collection.rs
 create mode 100644 server/src/api/config/metric_collection.rs
 create mode 100644 server/src/api/metric_collection.rs
 create mode 100644 server/src/metric_collection/collection_task.rs
 create mode 100644 server/src/metric_collection/rrd_task.rs
 create mode 100644 server/src/metric_collection/state.rs
 create mode 100644 server/src/metric_collection/top_entities.rs
 create mode 100644 server/src/test_support/temp.rs


Summary over all repositories:
  28 files changed, 2479 insertions(+), 362 deletions(-)

-- 
Generated by git-murpp 0.8.0


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 01/25] test support: add NamedTempFile helper
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 02/25] test support: add NamedTempDir helper Lukas Wagner
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This one is useful when writing tests, it automatically removes the
temporary file when dropped. The name was chosen because of the similar
NamedTempFile struct in the popular tempfile crate.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---

Notes:
    Cherry-picked from
    https://lore.proxmox.com/pdm-devel/20250128122520.167796-3-l.wagner@proxmox.com/T/#u

 server/src/lib.rs               |  2 +-
 server/src/test_support/mod.rs  |  4 ++++
 server/src/test_support/temp.rs | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+), 1 deletion(-)
 create mode 100644 server/src/test_support/temp.rs

diff --git a/server/src/lib.rs b/server/src/lib.rs
index 12dc912..143ee32 100644
--- a/server/src/lib.rs
+++ b/server/src/lib.rs
@@ -13,7 +13,7 @@ pub mod task_utils;
 pub mod connection;
 pub mod pbs_client;
 
-#[cfg(remote_config = "faked")]
+#[cfg(any(remote_config = "faked", test))]
 pub mod test_support;
 
 use anyhow::Error;
diff --git a/server/src/test_support/mod.rs b/server/src/test_support/mod.rs
index e54cd72..f026011 100644
--- a/server/src/test_support/mod.rs
+++ b/server/src/test_support/mod.rs
@@ -1 +1,5 @@
+#[cfg(remote_config = "faked")]
 pub mod fake_remote;
+
+#[cfg(test)]
+pub mod temp;
diff --git a/server/src/test_support/temp.rs b/server/src/test_support/temp.rs
new file mode 100644
index 0000000..a3a6d59
--- /dev/null
+++ b/server/src/test_support/temp.rs
@@ -0,0 +1,33 @@
+use std::path::{Path, PathBuf};
+
+use anyhow::Error;
+
+use proxmox_sys::fs::CreateOptions;
+
+/// Temporary file that be cleaned up when dropped.
+pub struct NamedTempFile {
+    path: PathBuf,
+}
+
+impl NamedTempFile {
+    /// Create a new temporary file.
+    ///
+    /// The file will be created with the passed [`CreateOptions`].
+    pub fn new(options: CreateOptions) -> Result<Self, Error> {
+        let base = std::env::temp_dir().join("test");
+        let (_, path) = proxmox_sys::fs::make_tmp_file(base, options)?;
+
+        Ok(Self { path })
+    }
+
+    /// Return the [`Path`] to the temporary file.
+    pub fn path(&self) -> &Path {
+        &self.path
+    }
+}
+
+impl Drop for NamedTempFile {
+    fn drop(&mut self) {
+        let _ = std::fs::remove_file(&self.path);
+    }
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 02/25] test support: add NamedTempDir helper
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 01/25] test support: add NamedTempFile helper Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 03/25] pdm-api-types: add CollectionSettings type Lukas Wagner
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This one is useful when writing tests, it automatically removes the
temporary directory when dropped.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/test_support/temp.rs | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/server/src/test_support/temp.rs b/server/src/test_support/temp.rs
index a3a6d59..a93c914 100644
--- a/server/src/test_support/temp.rs
+++ b/server/src/test_support/temp.rs
@@ -31,3 +31,30 @@ impl Drop for NamedTempFile {
         let _ = std::fs::remove_file(&self.path);
     }
 }
+
+/// Temporary directory that is cleaned up when dropped.
+pub struct NamedTempDir {
+    path: PathBuf,
+}
+
+impl NamedTempDir {
+    /// Create a new temporary directory.
+    ///
+    /// The directory will be created with `0o700` permissions.
+    pub fn new() -> Result<Self, Error> {
+        let path = proxmox_sys::fs::make_tmp_dir("/tmp", None)?;
+
+        Ok(Self { path })
+    }
+
+    /// Return the [`Path`] to the temporary directory.
+    pub fn path(&self) -> &Path {
+        &self.path
+    }
+}
+
+impl Drop for NamedTempDir {
+    fn drop(&mut self) {
+        let _ = std::fs::remove_dir_all(&self.path);
+    }
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 03/25] pdm-api-types: add CollectionSettings type
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 01/25] test support: add NamedTempFile helper Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 02/25] test support: add NamedTempDir helper Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 14:18   ` Maximiliano Sandoval
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 04/25] pdm-config: add functions for reading/writing metric collection settings Lukas Wagner
                   ` (22 subsequent siblings)
  25 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This commit adds the CollectionSettings type which holds settings for
the metric collection system. Included are collection interval, max
concurrency and upper/lower bounds for the metric collection loop.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 lib/pdm-api-types/src/lib.rs               |   3 +
 lib/pdm-api-types/src/metric_collection.rs | 180 +++++++++++++++++++++
 2 files changed, 183 insertions(+)
 create mode 100644 lib/pdm-api-types/src/metric_collection.rs

diff --git a/lib/pdm-api-types/src/lib.rs b/lib/pdm-api-types/src/lib.rs
index 3844907..6115e41 100644
--- a/lib/pdm-api-types/src/lib.rs
+++ b/lib/pdm-api-types/src/lib.rs
@@ -19,6 +19,9 @@ pub use acl::*;
 mod node_config;
 pub use node_config::*;
 
+mod metric_collection;
+pub use metric_collection::*;
+
 mod proxy;
 pub use proxy::HTTP_PROXY_SCHEMA;
 
diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
new file mode 100644
index 0000000..99dac9b
--- /dev/null
+++ b/lib/pdm-api-types/src/metric_collection.rs
@@ -0,0 +1,180 @@
+//! API types for metric collection settings.
+
+use serde::{Deserialize, Serialize};
+
+use proxmox_schema::{api, Updater};
+
+/// Default metric collection interval.
+pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
+
+/// Minimum metric collection interval.
+pub const MIN_COLLECTION_INTERVAL: u64 = 10;
+
+/// Maximum metric collection interval.
+/// PVE and PBS keep 30 minutes of metric history,
+/// maximum is set to 25 minutes to leave some headroom.
+pub const MAX_COLLECTION_INTERVAL: u64 = 1500;
+
+/// Default number of concurrent connections.
+pub const DEFAULT_CONCURRENCY: usize = 10;
+/// Maximum number of concurrent connections.
+pub const MAX_CONCURRENCY: u64 = 50;
+/// Minimum number of concurrent connections (no concurrency).
+pub const MIN_CONCURRENCY: u64 = 1;
+
+/// Default upper bound for the random delay for the
+/// main metric collection loop.
+pub const DEFAULT_MAX_OFFSET: u64 = 10;
+/// Default lower bound for the random delay for the
+/// main metric collection loop.
+pub const DEFAULT_MIN_OFFSET: u64 = 0;
+
+/// Highest configurable upper bound for the random interval offset for the main loop.
+pub const MAX_INTERVAL_OFFSET: u64 = 300;
+/// Lowest configureable lower bound for the random interval offset for the main loop.
+pub const MIN_INTERVAL_OFFSET: u64 = 0;
+
+/// Default upper bound for the random individual connection delay.
+pub const DEFAULT_MAX_CONNECTION_DELAY: u64 = 100;
+/// Default lower bound for the random individual connection delay.
+pub const DEFAULT_MIN_CONNECTION_DELAY: u64 = 0;
+
+/// Highest configurable upper bound for the random individual connection delay.
+pub const MAX_CONNECTION_DELAY: u64 = 1000;
+/// Lowest configurable lower bound for the random individual connection delay.
+pub const MIN_CONNECTION_DELAY: u64 = 0;
+
+#[api(
+    properties: {
+        "collection-interval" : {
+            optional: true,
+            default: DEFAULT_COLLECTION_INTERVAL as isize,
+            minimum: MIN_COLLECTION_INTERVAL as isize,
+            maximum: MAX_COLLECTION_INTERVAL as isize,
+        },
+        "max-concurrent-connections" : {
+            optional: true,
+            default: DEFAULT_CONCURRENCY as isize,
+            minimum: MIN_CONCURRENCY as isize,
+            maximum: MAX_CONCURRENCY as isize,
+        },
+        "max-interval-offset" : {
+            optional: true,
+            default: DEFAULT_MAX_OFFSET as isize,
+            minimum: MIN_INTERVAL_OFFSET as isize,
+            maximum: MAX_INTERVAL_OFFSET as isize,
+        },
+        "min-interval-offset" : {
+            optional: true,
+            default: DEFAULT_MIN_OFFSET as isize,
+            minimum: MIN_INTERVAL_OFFSET as isize,
+            maximum: MAX_INTERVAL_OFFSET as isize,
+        },
+        "max-connection-delay" : {
+            optional: true,
+            default: DEFAULT_MAX_CONNECTION_DELAY as isize,
+            minimum: MIN_CONNECTION_DELAY as isize,
+            maximum: MAX_CONNECTION_DELAY as isize,
+        },
+        "min-connection-delay" : {
+            optional: true,
+            default: DEFAULT_MIN_CONNECTION_DELAY as isize,
+            minimum: MIN_CONNECTION_DELAY as isize,
+            maximum: MAX_CONNECTION_DELAY as isize,
+        },
+    },
+)]
+#[derive(Clone, Default, Deserialize, Serialize, Updater)]
+#[serde(rename_all = "kebab-case")]
+/// Settings for the metric collection system.
+pub struct CollectionSettings {
+    /// Collection settings ID
+    #[updater(skip)]
+    pub id: String,
+
+    /// Interval in seconds at which to collect metrics.
+    /// The point in time at which metrics are collected
+    /// are aligned based on the collection interval. For instance,
+    /// a collection interval of 300 (5 minutes) would schedule metric collection
+    /// at 11:00:00, 11:05:00 (without accounting for the random offset).
+    ///
+    /// To avoid load spikes, metric collection runs are offeset by
+    /// a random number of seconds between min_interval_offset and max_interval_offset.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub collection_interval: Option<u64>,
+
+    /// Maximum number of concurrent connections while collecting metrics.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub max_concurrent_connections: Option<usize>,
+
+    /// Maximum offset in seconds for the metric collection interval.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub max_interval_offset: Option<u64>,
+
+    /// Minimum offset in seconds for the metric collection interval.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub min_interval_offset: Option<u64>,
+
+    /// Maximum random delay in milliseconds for each outgoing connection.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub max_connection_delay: Option<u64>,
+
+    /// Minimum random delay in milliseconds for each outgoing connection.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub min_connection_delay: Option<u64>,
+}
+
+#[api]
+#[derive(Copy, Clone, Deserialize, Serialize)]
+#[serde(rename_all = "kebab-case")]
+/// Deletable property for [`CollectionSettings`].
+pub enum DeletableCollectionSettingsProperty {
+    /// Delete 'collection-interval'.
+    CollectionInterval,
+    /// Delete 'max-concurrent-connections'.
+    MaxConcurrentConnections,
+    /// Delete 'max-interval-offset'.
+    MaxIntervalOffset,
+    /// Delete 'min-interval-offset'.
+    MinIntervalOffset,
+    /// Delete 'min-connection-delay'.
+    MaxConnectionDelay,
+    /// Delete 'min-connection-delay'.
+    MinConnectionDelay,
+}
+
+impl CollectionSettings {
+    /// Return the collection interval or the default if not configured.
+    pub fn collection_interval_or_default(&self) -> u64 {
+        self.collection_interval
+            .unwrap_or(DEFAULT_COLLECTION_INTERVAL)
+    }
+
+    /// Return the number of allowed concurrent connections or the default if not configured.
+    pub fn max_concurrent_connections_or_default(&self) -> usize {
+        self.max_concurrent_connections
+            .unwrap_or(DEFAULT_CONCURRENCY)
+    }
+
+    /// Return the upper bound for the main loop delay or the default if not configured.
+    pub fn max_interval_offset_or_default(&self) -> u64 {
+        self.max_interval_offset.unwrap_or(DEFAULT_MAX_OFFSET)
+    }
+
+    /// Return the lower bound for the main loop delay or the default if not configured.
+    pub fn min_interval_offset_or_default(&self) -> u64 {
+        self.min_interval_offset.unwrap_or(DEFAULT_MIN_OFFSET)
+    }
+
+    /// Return the upper bound for the connection delay or the default if not configured.
+    pub fn max_connection_delay_or_default(&self) -> u64 {
+        self.max_connection_delay
+            .unwrap_or(DEFAULT_MAX_CONNECTION_DELAY)
+    }
+
+    /// Return the lower bound for the connection delay or the default if not configured.
+    pub fn min_connection_delay_or_default(&self) -> u64 {
+        self.min_connection_delay
+            .unwrap_or(DEFAULT_MIN_CONNECTION_DELAY)
+    }
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 04/25] pdm-config: add functions for reading/writing metric collection settings
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (2 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 03/25] pdm-api-types: add CollectionSettings type Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 05/25] metric collection: split top_entities split into separate module Lukas Wagner
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This commit adds support for reading and writing a new configuration
file at /etc/proxmox-datacenter-manager/metric-collection.cfg.

It is a regular SectionConfig file holding CollectionSettings entries
for now. For starters, there will be only a single entry with the key
'default', e.g.

collection-settings: default
    max-concurrency 10
    ...

It might seem a bit odd to use a section config file for something like
this, but this gives us the ability to expand it more easily in the
future, e.g. different settings for different groups of remotes, other
section types for logical structuring, etc.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 lib/pdm-config/src/lib.rs               |  1 +
 lib/pdm-config/src/metric_collection.rs | 69 +++++++++++++++++++++++++
 2 files changed, 70 insertions(+)
 create mode 100644 lib/pdm-config/src/metric_collection.rs

diff --git a/lib/pdm-config/src/lib.rs b/lib/pdm-config/src/lib.rs
index ac398ca..d80ee40 100644
--- a/lib/pdm-config/src/lib.rs
+++ b/lib/pdm-config/src/lib.rs
@@ -5,6 +5,7 @@ pub use pdm_buildcfg::{BACKUP_GROUP_NAME, BACKUP_USER_NAME};
 
 pub mod certificate_config;
 pub mod domains;
+pub mod metric_collection;
 pub mod node;
 pub mod remotes;
 pub mod setup;
diff --git a/lib/pdm-config/src/metric_collection.rs b/lib/pdm-config/src/metric_collection.rs
new file mode 100644
index 0000000..613bb0e
--- /dev/null
+++ b/lib/pdm-config/src/metric_collection.rs
@@ -0,0 +1,69 @@
+//! Read/write metric collection configuration
+
+use std::sync::OnceLock;
+
+use anyhow::Error;
+
+use proxmox_config_digest::ConfigDigest;
+use proxmox_product_config::{open_api_lockfile, replace_config, ApiLockGuard};
+
+use pdm_api_types::{CollectionSettings, PROXMOX_SAFE_ID_FORMAT};
+
+use pdm_buildcfg::configdir;
+use proxmox_schema::{ApiType, ObjectSchema, Schema, StringSchema};
+use proxmox_section_config::{SectionConfig, SectionConfigData, SectionConfigPlugin};
+
+const METRIC_COLLECTION_CFG_FILENAME: &str = configdir!("/metric-collection.cfg");
+const METRIC_COLLECTION_CFG_LOCKFILE: &str = configdir!("/.metric-collection.lock");
+
+/// The section-config type name for metric collection settings.
+pub const COLLECTION_SETTINGS_TYPE: &str = "collection-settings";
+
+/// Section config schema for the public config file.
+fn config_parser() -> &'static SectionConfig {
+    static CONFIG: OnceLock<SectionConfig> = OnceLock::new();
+    CONFIG.get_or_init(config_init)
+}
+
+const ENTITY_NAME_SCHEMA: Schema = StringSchema::new("collection-settings key.")
+    .format(&PROXMOX_SAFE_ID_FORMAT)
+    .min_length(2)
+    .max_length(32)
+    .schema();
+
+fn config_init() -> SectionConfig {
+    let mut config = SectionConfig::new(&ENTITY_NAME_SCHEMA);
+
+    const MATCHER_SCHEMA: &ObjectSchema = CollectionSettings::API_SCHEMA.unwrap_object_schema();
+    config.register_plugin(SectionConfigPlugin::new(
+        COLLECTION_SETTINGS_TYPE.into(),
+        Some(String::from("id")),
+        MATCHER_SCHEMA,
+    ));
+
+    config
+}
+
+/// Lock the metric-collection config
+pub fn lock_config() -> Result<ApiLockGuard, Error> {
+    open_api_lockfile(METRIC_COLLECTION_CFG_LOCKFILE, None, true)
+}
+
+/// Return contents of the metric-collection config
+pub fn config() -> Result<(SectionConfigData, ConfigDigest), Error> {
+    let content = proxmox_sys::fs::file_read_optional_string(METRIC_COLLECTION_CFG_FILENAME)?
+        .unwrap_or_default();
+    //
+    let digest = openssl::sha::sha256(content.as_bytes());
+    let data = config_parser().parse(METRIC_COLLECTION_CFG_FILENAME, &content)?;
+
+    Ok((data, digest.into()))
+}
+
+/// Replace the currently persisted metric-collection config
+pub fn save_config(config: &SectionConfigData) -> Result<(), Error> {
+    let content = config_parser().write(METRIC_COLLECTION_CFG_FILENAME, config)?;
+    replace_config(METRIC_COLLECTION_CFG_FILENAME, content.as_bytes())?;
+
+    Ok(())
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 05/25] metric collection: split top_entities split into separate module
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (3 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 04/25] pdm-config: add functions for reading/writing metric collection settings Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 06/25] metric collection: save metric data to RRD in separate task Lukas Wagner
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This makes the parent module a bit more easy not navigate.

No functional changes intended.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/resources.rs                  |   3 +-
 server/src/metric_collection/mod.rs          | 149 +------------------
 server/src/metric_collection/top_entities.rs | 148 ++++++++++++++++++
 3 files changed, 152 insertions(+), 148 deletions(-)
 create mode 100644 server/src/metric_collection/top_entities.rs

diff --git a/server/src/api/resources.rs b/server/src/api/resources.rs
index 453d9e8..afda51e 100644
--- a/server/src/api/resources.rs
+++ b/server/src/api/resources.rs
@@ -26,6 +26,7 @@ use proxmox_subscription::SubscriptionStatus;
 use pve_api_types::{ClusterResource, ClusterResourceType};
 
 use crate::connection;
+use crate::metric_collection::top_entities;
 
 pub const ROUTER: Router = Router::new()
     .get(&list_subdirs_api_method!(SUBDIRS))
@@ -320,7 +321,7 @@ async fn get_top_entities(timeframe: Option<RrdTimeframe>) -> Result<TopEntities
     let (remotes_config, _) = pdm_config::remotes::config()?;
 
     let timeframe = timeframe.unwrap_or(RrdTimeframe::Day);
-    let res = crate::metric_collection::calculate_top(&remotes_config.sections, timeframe, 10);
+    let res = top_entities::calculate_top(&remotes_config.sections, timeframe, 10);
     Ok(res)
 }
 
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index b37d678..5540f93 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -6,15 +6,13 @@ use anyhow::Error;
 use pbs_api_types::{MetricDataPoint, MetricDataType};
 use proxmox_rrd::rrd::DataSourceType;
 
-use pdm_api_types::{
-    remotes::RemoteType,
-    resource::{Resource, ResourceRrdData, TopEntities, TopEntity},
-};
+use pdm_api_types::remotes::RemoteType;
 use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
 
 use crate::{connection, task_utils};
 
 pub mod rrd_cache;
+pub mod top_entities;
 
 const COLLECTION_INTERVAL: u64 = 60;
 
@@ -150,146 +148,3 @@ fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
         data_source_type,
     );
 }
-
-fn insert_sorted<T>(vec: &mut Vec<(usize, T)>, value: (usize, T), limit: usize) {
-    let index = match vec.binary_search_by_key(&value.0, |(idx, _)| *idx) {
-        Ok(idx) | Err(idx) => idx,
-    };
-
-    vec.insert(index, value);
-    if vec.len() > limit {
-        for _ in 0..(vec.len() - limit) {
-            vec.remove(0);
-        }
-    }
-}
-
-// for now simple sum of the values => area under the graph curve
-fn calculate_coefficient(values: &proxmox_rrd::Entry) -> f64 {
-    let mut coefficient = 0.0;
-    for point in values.data.iter() {
-        let value = point.unwrap_or_default();
-        if value.is_finite() {
-            coefficient += value;
-        }
-    }
-
-    coefficient
-}
-
-// FIXME: cache the values instead of calculate freshly every time?
-// FIXME: find better way to enumerate nodes/guests/etc.(instead of relying on the cache)
-pub fn calculate_top(
-    remotes: &HashMap<String, pdm_api_types::remotes::Remote>,
-    timeframe: proxmox_rrd_api_types::RrdTimeframe,
-    num: usize,
-) -> TopEntities {
-    let mut guest_cpu = Vec::new();
-    let mut node_cpu = Vec::new();
-    let mut node_memory = Vec::new();
-
-    for remote_name in remotes.keys() {
-        if let Some(data) =
-            crate::api::resources::get_cached_resources(remote_name, i64::MAX as u64)
-        {
-            for res in data.resources {
-                let id = res.id().to_string();
-                let name = format!("pve/{remote_name}/{id}");
-                match &res {
-                    Resource::PveStorage(_) => {}
-                    Resource::PveQemu(_) | Resource::PveLxc(_) => {
-                        if let Some(entity) =
-                            get_entity(timeframe, remote_name, res, name, "cpu_current")
-                        {
-                            let coefficient = (entity.0 * 100.0).round() as usize;
-                            insert_sorted(&mut guest_cpu, (coefficient, entity.1), num);
-                        }
-                    }
-                    Resource::PveNode(_) => {
-                        if let Some(entity) = get_entity(
-                            timeframe,
-                            remote_name,
-                            res.clone(),
-                            name.clone(),
-                            "cpu_current",
-                        ) {
-                            let coefficient = (entity.0 * 100.0).round() as usize;
-                            insert_sorted(&mut node_cpu, (coefficient, entity.1), num);
-                        }
-                        // convert mem/mem_total into a single entity
-                        if let Some(mut mem) = get_entity(
-                            timeframe,
-                            remote_name,
-                            res.clone(),
-                            name.clone(),
-                            "mem_used",
-                        ) {
-                            if let Some(mem_total) =
-                                get_entity(timeframe, remote_name, res, name, "mem_total")
-                            {
-                                // skip if we don't have the same amount of data for used and total
-                                let mem_rrd = &mem.1.rrd_data.data;
-                                let mem_total_rrd = &mem_total.1.rrd_data.data;
-                                if mem_rrd.len() != mem_total_rrd.len() {
-                                    continue;
-                                }
-                                let coefficient = (100.0 * mem.0 / mem_total.0).round() as usize;
-                                let mut mem_usage = Vec::new();
-                                for i in 0..mem_rrd.len() {
-                                    let point = match (mem_rrd[i], mem_total_rrd[i]) {
-                                        (Some(mem), Some(total)) => Some(mem / total),
-                                        _ => None,
-                                    };
-                                    mem_usage.push(point)
-                                }
-                                mem.1.rrd_data.data = mem_usage;
-                                insert_sorted(&mut node_memory, (coefficient, mem.1), num);
-                            }
-                        }
-                    }
-                    Resource::PbsNode(_) => {}
-                    Resource::PbsDatastore(_) => {}
-                }
-            }
-        }
-    }
-
-    TopEntities {
-        guest_cpu: guest_cpu.into_iter().map(|(_, entity)| entity).collect(),
-        node_cpu: node_cpu.into_iter().map(|(_, entity)| entity).collect(),
-        node_memory: node_memory.into_iter().map(|(_, entity)| entity).collect(),
-    }
-}
-
-fn get_entity(
-    timeframe: proxmox_rrd_api_types::RrdTimeframe,
-    remote_name: &String,
-    res: Resource,
-    name: String,
-    metric: &str,
-) -> Option<(f64, TopEntity)> {
-    if let Ok(Some(values)) = rrd_cache::extract_data(
-        &name,
-        metric,
-        timeframe,
-        proxmox_rrd_api_types::RrdMode::Average,
-    ) {
-        let coefficient = calculate_coefficient(&values);
-        if coefficient > 0.0 {
-            return Some((
-                coefficient,
-                TopEntity {
-                    remote: remote_name.to_string(),
-                    resource: res,
-                    rrd_data: ResourceRrdData {
-                        start: values.start,
-                        resolution: values.resolution,
-                        data: values.data,
-                    },
-                },
-            ));
-        }
-    }
-
-    None
-}
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
new file mode 100644
index 0000000..f8e053f
--- /dev/null
+++ b/server/src/metric_collection/top_entities.rs
@@ -0,0 +1,148 @@
+use std::collections::HashMap;
+
+use pdm_api_types::resource::{Resource, ResourceRrdData, TopEntities, TopEntity};
+
+use super::rrd_cache;
+
+fn insert_sorted<T>(vec: &mut Vec<(usize, T)>, value: (usize, T), limit: usize) {
+    let index = match vec.binary_search_by_key(&value.0, |(idx, _)| *idx) {
+        Ok(idx) | Err(idx) => idx,
+    };
+
+    vec.insert(index, value);
+    if vec.len() > limit {
+        for _ in 0..(vec.len() - limit) {
+            vec.remove(0);
+        }
+    }
+}
+
+// for now simple sum of the values => area under the graph curve
+fn calculate_coefficient(values: &proxmox_rrd::Entry) -> f64 {
+    let mut coefficient = 0.0;
+    for point in values.data.iter() {
+        let value = point.unwrap_or_default();
+        if value.is_finite() {
+            coefficient += value;
+        }
+    }
+
+    coefficient
+}
+
+// FIXME: cache the values instead of calculate freshly every time?
+// FIXME: find better way to enumerate nodes/guests/etc.(instead of relying on the cache)
+pub fn calculate_top(
+    remotes: &HashMap<String, pdm_api_types::remotes::Remote>,
+    timeframe: proxmox_rrd_api_types::RrdTimeframe,
+    num: usize,
+) -> TopEntities {
+    let mut guest_cpu = Vec::new();
+    let mut node_cpu = Vec::new();
+    let mut node_memory = Vec::new();
+
+    for remote_name in remotes.keys() {
+        if let Some(data) =
+            crate::api::resources::get_cached_resources(remote_name, i64::MAX as u64)
+        {
+            for res in data.resources {
+                let id = res.id().to_string();
+                let name = format!("pve/{remote_name}/{id}");
+                match &res {
+                    Resource::PveStorage(_) => {}
+                    Resource::PveQemu(_) | Resource::PveLxc(_) => {
+                        if let Some(entity) =
+                            get_entity(timeframe, remote_name, res, name, "cpu_current")
+                        {
+                            let coefficient = (entity.0 * 100.0).round() as usize;
+                            insert_sorted(&mut guest_cpu, (coefficient, entity.1), num);
+                        }
+                    }
+                    Resource::PveNode(_) => {
+                        if let Some(entity) = get_entity(
+                            timeframe,
+                            remote_name,
+                            res.clone(),
+                            name.clone(),
+                            "cpu_current",
+                        ) {
+                            let coefficient = (entity.0 * 100.0).round() as usize;
+                            insert_sorted(&mut node_cpu, (coefficient, entity.1), num);
+                        }
+                        // convert mem/mem_total into a single entity
+                        if let Some(mut mem) = get_entity(
+                            timeframe,
+                            remote_name,
+                            res.clone(),
+                            name.clone(),
+                            "mem_used",
+                        ) {
+                            if let Some(mem_total) =
+                                get_entity(timeframe, remote_name, res, name, "mem_total")
+                            {
+                                // skip if we don't have the same amount of data for used and total
+                                let mem_rrd = &mem.1.rrd_data.data;
+                                let mem_total_rrd = &mem_total.1.rrd_data.data;
+                                if mem_rrd.len() != mem_total_rrd.len() {
+                                    continue;
+                                }
+                                let coefficient = (100.0 * mem.0 / mem_total.0).round() as usize;
+                                let mut mem_usage = Vec::new();
+                                for i in 0..mem_rrd.len() {
+                                    let point = match (mem_rrd[i], mem_total_rrd[i]) {
+                                        (Some(mem), Some(total)) => Some(mem / total),
+                                        _ => None,
+                                    };
+                                    mem_usage.push(point)
+                                }
+                                mem.1.rrd_data.data = mem_usage;
+                                insert_sorted(&mut node_memory, (coefficient, mem.1), num);
+                            }
+                        }
+                    }
+                    Resource::PbsNode(_) => {}
+                    Resource::PbsDatastore(_) => {}
+                }
+            }
+        }
+    }
+
+    TopEntities {
+        guest_cpu: guest_cpu.into_iter().map(|(_, entity)| entity).collect(),
+        node_cpu: node_cpu.into_iter().map(|(_, entity)| entity).collect(),
+        node_memory: node_memory.into_iter().map(|(_, entity)| entity).collect(),
+    }
+}
+
+fn get_entity(
+    timeframe: proxmox_rrd_api_types::RrdTimeframe,
+    remote_name: &String,
+    res: Resource,
+    name: String,
+    metric: &str,
+) -> Option<(f64, TopEntity)> {
+    if let Ok(Some(values)) = rrd_cache::extract_data(
+        &name,
+        metric,
+        timeframe,
+        proxmox_rrd_api_types::RrdMode::Average,
+    ) {
+        let coefficient = calculate_coefficient(&values);
+        if coefficient > 0.0 {
+            return Some((
+                coefficient,
+                TopEntity {
+                    remote: remote_name.to_string(),
+                    resource: res,
+                    rrd_data: ResourceRrdData {
+                        start: values.start,
+                        resolution: values.resolution,
+                        data: values.data,
+                    },
+                },
+            ));
+        }
+    }
+
+    None
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 06/25] metric collection: save metric data to RRD in separate task
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (4 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 05/25] metric collection: split top_entities split into separate module Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-12 13:59   ` Wolfgang Bumiller
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task Lukas Wagner
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This is a preparation for concurrent metric collection. While the RRD
cache appears to be safe to concurrent access (it uses an RwLock to
protect the data), delegating all RRD writes to a separate tokio task
makes it IMO easier to reason about and understand. Furthermore, it
decouples the 'metric fetching' and 'metric
storing' parts, making it much easier to write tests for both parts
independently.

For better code separation, this also splits out the new task into
a new submodule.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/metric_collection/mod.rs      | 102 +++++++----------------
 server/src/metric_collection/rrd_task.rs |  92 ++++++++++++++++++++
 2 files changed, 124 insertions(+), 70 deletions(-)
 create mode 100644 server/src/metric_collection/rrd_task.rs

diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 5540f93..06ade5f 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -2,18 +2,18 @@ use std::collections::HashMap;
 use std::pin::pin;
 
 use anyhow::Error;
-
-use pbs_api_types::{MetricDataPoint, MetricDataType};
-use proxmox_rrd::rrd::DataSourceType;
+use tokio::sync::mpsc::{self, Sender};
 
 use pdm_api_types::remotes::RemoteType;
-use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
 
 use crate::{connection, task_utils};
 
 pub mod rrd_cache;
+mod rrd_task;
 pub mod top_entities;
 
+use rrd_task::RrdStoreRequest;
+
 const COLLECTION_INTERVAL: u64 = 60;
 
 /// Initialize the RRD cache
@@ -25,14 +25,22 @@ pub fn init() -> Result<(), Error> {
 
 /// Start the metric collection task.
 pub fn start_task() {
+    let (tx, rx) = mpsc::channel(128);
+
+    tokio::spawn(async move {
+        let task_scheduler = pin!(metric_collection_task(tx));
+        let abort_future = pin!(proxmox_daemon::shutdown_future());
+        futures::future::select(task_scheduler, abort_future).await;
+    });
+
     tokio::spawn(async move {
-        let task_scheduler = pin!(metric_collection_task());
+        let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
         let abort_future = pin!(proxmox_daemon::shutdown_future());
         futures::future::select(task_scheduler, abort_future).await;
     });
 }
 
-async fn metric_collection_task() -> Result<(), Error> {
+async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
     let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
 
     loop {
@@ -59,37 +67,33 @@ async fn metric_collection_task() -> Result<(), Error> {
                             .cluster_metrics_export(Some(true), Some(false), Some(start_time))
                             .await?;
 
-                        //// Involves some blocking file IO
-                        tokio::task::spawn_blocking(move || {
-                            let mut most_recent_timestamp = 0;
+                        let most_recent =
+                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
 
-                            for data_point in metrics.data {
-                                most_recent_timestamp =
-                                    most_recent_timestamp.max(data_point.timestamp);
-                                store_metric_pve(&remote_name_clone, &data_point);
-                            }
+                        sender
+                            .send(RrdStoreRequest::Pve {
+                                remote: remote_name_clone,
+                                metrics,
+                            })
+                            .await?;
 
-                            most_recent_timestamp
-                        })
-                        .await
+                        Ok::<i64, Error>(most_recent)
                     }
                     RemoteType::Pbs => {
                         let client = connection::make_pbs_client(remote)?;
                         let metrics = client.metrics(Some(true), Some(start_time)).await?;
 
-                        // Involves some blocking file IO
-                        tokio::task::spawn_blocking(move || {
-                            let mut most_recent_timestamp = 0;
+                        let most_recent =
+                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
 
-                            for data_point in metrics.data {
-                                most_recent_timestamp =
-                                    most_recent_timestamp.max(data_point.timestamp);
-                                store_metric_pbs(&remote_name_clone, &data_point);
-                            }
+                        sender
+                            .send(RrdStoreRequest::Pbs {
+                                remote: remote_name_clone,
+                                metrics,
+                            })
+                            .await?;
 
-                            most_recent_timestamp
-                        })
-                        .await
+                        Ok::<i64, Error>(most_recent)
                     }
                 }?;
 
@@ -106,45 +110,3 @@ async fn metric_collection_task() -> Result<(), Error> {
         }
     }
 }
-
-fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
-    let name = format!(
-        "pve/{remote_name}/{id}/{metric}",
-        id = data_point.id,
-        metric = data_point.metric,
-    );
-
-    let data_source_type = match data_point.ty {
-        ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
-        ClusterMetricsDataType::Counter => DataSourceType::Counter,
-        ClusterMetricsDataType::Derive => DataSourceType::Derive,
-    };
-
-    rrd_cache::update_value(
-        &name,
-        data_point.value,
-        data_point.timestamp,
-        data_source_type,
-    );
-}
-
-fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
-    let name = format!(
-        "pbs/{remote_name}/{id}/{metric}",
-        id = data_point.id,
-        metric = data_point.metric,
-    );
-
-    let data_source_type = match data_point.ty {
-        MetricDataType::Gauge => DataSourceType::Gauge,
-        MetricDataType::Counter => DataSourceType::Counter,
-        MetricDataType::Derive => DataSourceType::Derive,
-    };
-
-    rrd_cache::update_value(
-        &name,
-        data_point.value,
-        data_point.timestamp,
-        data_source_type,
-    );
-}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
new file mode 100644
index 0000000..fe19580
--- /dev/null
+++ b/server/src/metric_collection/rrd_task.rs
@@ -0,0 +1,92 @@
+use anyhow::Error;
+use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
+use proxmox_rrd::rrd::DataSourceType;
+use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
+use tokio::sync::mpsc::Receiver;
+
+use super::rrd_cache;
+
+/// Store requrest for the RRD task.
+pub(super) enum RrdStoreRequest {
+    /// Store PVE metrics.
+    Pve {
+        /// Remote name
+        remote: String,
+        /// Metric data
+        metrics: ClusterMetrics,
+    },
+    /// Store PBS metrics.
+    Pbs {
+        /// Remote name
+        remote: String,
+        /// Metric data
+        metrics: Metrics,
+    },
+}
+
+/// Task which stores received metrics in the RRD. Metric data is fed into
+/// this task via a MPSC channel.
+pub(super) async fn store_in_rrd_task(
+    mut receiver: Receiver<RrdStoreRequest>,
+) -> Result<(), Error> {
+    while let Some(msg) = receiver.recv().await {
+        //// Involves some blocking file IO
+        tokio::task::spawn_blocking(move || match msg {
+            RrdStoreRequest::Pve { remote, metrics } => {
+                for data_point in metrics.data {
+                    store_metric_pve(&remote, &data_point);
+                }
+            }
+            RrdStoreRequest::Pbs { remote, metrics } => {
+                for data_point in metrics.data {
+                    store_metric_pbs(&remote, &data_point);
+                }
+            }
+        })
+        .await?;
+    }
+
+    Ok(())
+}
+
+fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
+    let name = format!(
+        "pve/{remote_name}/{id}/{metric}",
+        id = data_point.id,
+        metric = data_point.metric,
+    );
+
+    let data_source_type = match data_point.ty {
+        ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
+        ClusterMetricsDataType::Counter => DataSourceType::Counter,
+        ClusterMetricsDataType::Derive => DataSourceType::Derive,
+    };
+
+    rrd_cache::update_value(
+        &name,
+        data_point.value,
+        data_point.timestamp,
+        data_source_type,
+    );
+}
+
+fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
+    let name = format!(
+        "pbs/{remote_name}/{id}/{metric}",
+        id = data_point.id,
+        metric = data_point.metric,
+    );
+
+    let data_source_type = match data_point.ty {
+        MetricDataType::Gauge => DataSourceType::Gauge,
+        MetricDataType::Counter => DataSourceType::Counter,
+        MetricDataType::Derive => DataSourceType::Derive,
+    };
+
+    rrd_cache::update_value(
+        &name,
+        data_point.value,
+        data_point.timestamp,
+        data_source_type,
+    );
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (5 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 06/25] metric collection: save metric data to RRD in separate task Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:58   ` Lukas Wagner
  2025-02-12 15:57   ` Wolfgang Bumiller
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 08/25] metric collection: persist state after metric collection Lukas Wagner
                   ` (18 subsequent siblings)
  25 siblings, 2 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

Metric collection is mostly limited by network latency, not disk IO or
CPU. This means we should be able get significant speedups by polling
remotes concurrently by spawning tokio workers.

To avoid load/IO spikes, we limit the number of concurrent connections
using a semaphore.

Each concurrent task which fetches a single remote waits a random amount
of time before actually connecting to the remote. The upper and lower
bounds for this random delay can be configured in the metric collection
settings. The main aim of this mechanism is to reduce load spikes.

Furthermore, each time when the main collection interval timer
fires, a random delay is introduced before actually starting the
collection process.
This is useful due to how we set up the timer; we configure
it to figure at aligned points in time. For instance, if the
collection interval is set to 60s, the timer fires at
minute boundaries. Using this random offset, we can avoid
triggering at the same time as other timers (cron, systemd).

Furthermore, we add a mechanism to trigger metric collection manually.
This is achieve by using `tokio::select!` on both, the timer's `tick`
method and the `recv` method of an `mpsc` which is used to send control
messages to the metric collection task.

For better code structure, the collection task is split into a separate
module.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 Cargo.toml                                    |   1 +
 server/Cargo.toml                             |   1 +
 server/src/bin/proxmox-datacenter-api.rs      |   2 +-
 .../src/metric_collection/collection_task.rs  | 317 ++++++++++++++++++
 server/src/metric_collection/mod.rs           | 128 +++----
 5 files changed, 366 insertions(+), 83 deletions(-)
 create mode 100644 server/src/metric_collection/collection_task.rs

diff --git a/Cargo.toml b/Cargo.toml
index 4f3b1d0..7dd60a5 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -108,6 +108,7 @@ once_cell = "1.3.1"
 openssl = "0.10.40"
 percent-encoding = "2.1"
 pin-project-lite = "0.2"
+rand = "0.8"
 regex = "1.5.5"
 serde = { version = "1.0", features = ["derive"] }
 serde_cbor = "0.11.1"
diff --git a/server/Cargo.toml b/server/Cargo.toml
index 7b0058e..c8308f2 100644
--- a/server/Cargo.toml
+++ b/server/Cargo.toml
@@ -24,6 +24,7 @@ nix.workspace = true
 once_cell.workspace = true
 openssl.workspace = true
 percent-encoding.workspace = true
+rand.workspace = true
 serde.workspace = true
 serde_json.workspace = true
 syslog.workspace = true
diff --git a/server/src/bin/proxmox-datacenter-api.rs b/server/src/bin/proxmox-datacenter-api.rs
index a79094d..6e85e52 100644
--- a/server/src/bin/proxmox-datacenter-api.rs
+++ b/server/src/bin/proxmox-datacenter-api.rs
@@ -286,7 +286,7 @@ async fn run(debug: bool) -> Result<(), Error> {
     });
 
     start_task_scheduler();
-    metric_collection::start_task();
+    metric_collection::start_task()?;
     resource_cache::start_task();
 
     server.await?;
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
new file mode 100644
index 0000000..8420039
--- /dev/null
+++ b/server/src/metric_collection/collection_task.rs
@@ -0,0 +1,317 @@
+use std::{collections::HashMap, sync::Arc, time::Duration};
+
+use anyhow::Error;
+use rand::Rng;
+use tokio::{
+    sync::{
+        mpsc::{Receiver, Sender},
+        OwnedSemaphorePermit, Semaphore,
+    },
+    time::Interval,
+};
+
+use proxmox_section_config::typed::SectionConfigData;
+
+use pdm_api_types::{
+    remotes::{Remote, RemoteType},
+    CollectionSettings,
+};
+use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
+
+use crate::{connection, task_utils};
+
+use super::rrd_task::RrdStoreRequest;
+
+/// Control messages for the metric collection task.
+pub(super) enum ControlMsg {
+    CollectSingleRemote(String),
+    CollectAllRemotes,
+}
+
+/// Task which periodically collects metrics from all remotes and stores
+/// them in the local metrics database.
+pub(super) struct MetricCollectionTask {
+    most_recent_timestamps: HashMap<String, i64>,
+    settings: CollectionSettings,
+    metric_data_tx: Sender<RrdStoreRequest>,
+    control_message_rx: Receiver<ControlMsg>,
+}
+
+impl MetricCollectionTask {
+    /// Create a new metric collection task.
+    pub(super) fn new(
+        metric_data_tx: Sender<RrdStoreRequest>,
+        control_message_rx: Receiver<ControlMsg>,
+    ) -> Result<Self, Error> {
+        let settings = Self::get_settings_or_default();
+
+        Ok(Self {
+            most_recent_timestamps: HashMap::new(),
+            settings,
+            metric_data_tx,
+            control_message_rx,
+        })
+    }
+
+    /// Run the metric collection task.
+    ///
+    /// This function does never return.
+    #[tracing::instrument(skip_all, name = "metric_collection_task")]
+    pub(super) async fn run(&mut self) {
+        let mut timer = Self::setup_timer(self.settings.collection_interval_or_default());
+
+        log::debug!(
+            "metric collection starting up. collection interval set to {} seconds",
+            self.settings.collection_interval_or_default()
+        );
+
+        loop {
+            let old_settings = self.settings.clone();
+            tokio::select! {
+                _ = timer.tick() => {
+                    // Reload settings in case they have changed in the meanwhile
+                    self.settings = Self::get_settings_or_default();
+
+                    log::debug!("starting metric collection from all remotes - triggered by timer");
+                    self.sleep_for_random_interval_offset().await;
+
+                    if let Some(remotes) = Self::load_remote_config() {
+                        let to_fetch = remotes.order.as_slice();
+                        self.fetch_remotes(&remotes, to_fetch).await;
+                    }
+                }
+
+                val = self.control_message_rx.recv() => {
+                    // Reload settings in case they have changed in the meanwhile
+                    self.settings = Self::get_settings_or_default();
+                    match val {
+                        Some(ControlMsg::CollectSingleRemote(remote)) => {
+                            if let Some(remotes) = Self::load_remote_config() {
+                                log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
+                                self.fetch_remotes(&remotes, &[remote]).await;
+                            }
+                        }
+                        Some(ControlMsg::CollectAllRemotes) => {
+                            if let Some(remotes) = Self::load_remote_config() {
+                                log::debug!("starting metric collection from all remotes - triggered by control message");
+                                self.fetch_remotes(&remotes, &remotes.order).await;
+                            }
+                        }
+                        _ => {},
+                    }
+                }
+            }
+
+            let interval = self.settings.collection_interval_or_default();
+
+            if old_settings.collection_interval_or_default() != interval {
+                log::info!(
+                    "metric collection interval changed to {} seconds, reloading timer",
+                    interval
+                );
+                timer = Self::setup_timer(interval);
+            }
+        }
+    }
+
+    async fn sleep_for_random_interval_offset(&self) {
+        let mut min = self.settings.min_interval_offset_or_default();
+        let max = self.settings.max_interval_offset_or_default();
+
+        if min > max {
+            log::warn!(
+                "min-interval-offset is larger than max-interval-offset ({min} > {max}) - \
+                capping it to max-interval-offset ({max})"
+            );
+            min = max;
+        }
+
+        let jitter = {
+            let mut rng = rand::thread_rng();
+            rng.gen_range(min..=max)
+        };
+        tokio::time::sleep(Duration::from_secs(jitter)).await;
+    }
+
+    async fn sleep_for_random_connection_delay(settings: &CollectionSettings) {
+        let mut min = settings.min_connection_delay_or_default();
+        let max = settings.max_connection_delay_or_default();
+
+        if min > max {
+            log::warn!(
+                "min-collection-delay is larger than max-collection-delay ({min} > {max}) - \
+                capping it to max-collection-delay ({max})"
+            );
+            min = max;
+        }
+
+        let jitter = {
+            let mut rng = rand::thread_rng();
+            rng.gen_range(min..=max)
+        };
+
+        tokio::time::sleep(Duration::from_millis(jitter)).await;
+    }
+
+    fn get_settings_or_default() -> CollectionSettings {
+        // This function is a bit odd, but not sure if there is a much nicer
+        // way to do it. We want to fall back to defaults if
+        //   - the config file does not exist (no errors logged)
+        //   - if section type is wrong or the config failed to parse (log errors)
+
+        fn get_settings_impl() -> Result<CollectionSettings, Error> {
+            let (config, _) = pdm_config::metric_collection::config()?;
+
+            let all_sections: Vec<CollectionSettings> =
+                config.convert_to_typed_array(COLLECTION_SETTINGS_TYPE)?;
+
+            for section in all_sections {
+                if section.id == "default" {
+                    return Ok(section);
+                }
+            }
+
+            Ok(CollectionSettings {
+                id: "default".into(),
+                ..Default::default()
+            })
+        }
+
+        match get_settings_impl() {
+            Ok(settings) => settings,
+            Err(err) => {
+                log::error!("could not read metric collection settings: {err} - falling back to default config");
+
+                CollectionSettings {
+                    id: "default".into(),
+                    ..Default::default()
+                }
+            }
+        }
+    }
+
+    /// Set up a [`tokio::time::Interval`] instance with the provided interval.
+    /// The timer will be aligned, e.g. an interval of `60` will let the timer
+    /// fire at minute boundaries.
+    fn setup_timer(interval: u64) -> Interval {
+        let mut timer = tokio::time::interval(Duration::from_secs(interval));
+        let first_run = task_utils::next_aligned_instant(interval).into();
+        timer.reset_at(first_run);
+
+        timer
+    }
+
+    /// Convenience helper to load `remote.cfg`, logging the error
+    /// and returning `None` if the config could not be read.
+    fn load_remote_config() -> Option<SectionConfigData<Remote>> {
+        match pdm_config::remotes::config() {
+            Ok((remotes, _)) => Some(remotes),
+            Err(e) => {
+                log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
+                None
+            }
+        }
+    }
+
+    /// Fetch metric data from a provided list of remotes concurrently.
+    /// The maximum number of concurrent connections is determined by
+    /// `max_concurrent_connections` in the [`CollectionSettings`]
+    /// instance in `self`.
+    async fn fetch_remotes(
+        &mut self,
+        config: &SectionConfigData<Remote>,
+        remotes_to_fetch: &[String],
+    ) {
+        let semaphore = Arc::new(Semaphore::new(
+            self.settings.max_concurrent_connections_or_default(),
+        ));
+        let mut handles = Vec::new();
+
+        for remote_name in remotes_to_fetch {
+            let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
+
+            // unwrap is okay here, acquire_* will only fail if `close` has been
+            // called on the semaphore.
+            let permit = semaphore.clone().acquire_owned().await.unwrap();
+
+            if let Some(remote) = config.get(remote_name).cloned() {
+                log::debug!("fetching remote '{}'", remote.id);
+                let handle = tokio::spawn(Self::fetch_single_remote(
+                    self.settings.clone(),
+                    remote,
+                    start_time,
+                    self.metric_data_tx.clone(),
+                    permit,
+                ));
+
+                handles.push((remote_name.clone(), handle));
+            }
+        }
+
+        for (remote_name, handle) in handles {
+            let res = handle.await;
+
+            match res {
+                Ok(Ok(ts)) => {
+                    self.most_recent_timestamps
+                        .insert(remote_name.to_string(), ts);
+                }
+                Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
+                Err(err) => {
+                    log::error!(
+                        "join error for metric collection task for remote {remote_name}: {err}"
+                    )
+                }
+            }
+        }
+    }
+
+    /// Fetch a single remote.
+    #[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
+    async fn fetch_single_remote(
+        settings: CollectionSettings,
+        remote: Remote,
+        start_time: i64,
+        sender: Sender<RrdStoreRequest>,
+        _permit: OwnedSemaphorePermit,
+    ) -> Result<i64, Error> {
+        Self::sleep_for_random_connection_delay(&settings).await;
+
+        let most_recent_timestamp = match remote.ty {
+            RemoteType::Pve => {
+                let client = connection::make_pve_client(&remote)?;
+                let metrics = client
+                    .cluster_metrics_export(Some(true), Some(false), Some(start_time))
+                    .await?;
+
+                let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+
+                sender
+                    .send(RrdStoreRequest::Pve {
+                        remote: remote.id.clone(),
+                        metrics,
+                    })
+                    .await?;
+
+                most_recent
+            }
+            RemoteType::Pbs => {
+                let client = connection::make_pbs_client(&remote)?;
+                let metrics = client.metrics(Some(true), Some(start_time)).await?;
+
+                let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+
+                sender
+                    .send(RrdStoreRequest::Pbs {
+                        remote: remote.id.clone(),
+                        metrics,
+                    })
+                    .await?;
+
+                most_recent
+            }
+        };
+
+        Ok(most_recent_timestamp)
+    }
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 06ade5f..0ab6678 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -1,20 +1,17 @@
-use std::collections::HashMap;
 use std::pin::pin;
+use std::sync::OnceLock;
 
-use anyhow::Error;
+use anyhow::{bail, Error};
 use tokio::sync::mpsc::{self, Sender};
 
-use pdm_api_types::remotes::RemoteType;
-
-use crate::{connection, task_utils};
-
+mod collection_task;
 pub mod rrd_cache;
 mod rrd_task;
 pub mod top_entities;
 
-use rrd_task::RrdStoreRequest;
+use collection_task::{ControlMsg, MetricCollectionTask};
 
-const COLLECTION_INTERVAL: u64 = 60;
+static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
 
 /// Initialize the RRD cache
 pub fn init() -> Result<(), Error> {
@@ -24,89 +21,56 @@ pub fn init() -> Result<(), Error> {
 }
 
 /// Start the metric collection task.
-pub fn start_task() {
-    let (tx, rx) = mpsc::channel(128);
+pub fn start_task() -> Result<(), Error> {
+    let (metric_data_tx, metric_data_rx) = mpsc::channel(128);
+
+    let (trigger_collection_tx, trigger_collection_rx) = mpsc::channel(128);
+    if CONTROL_MESSAGE_TX.set(trigger_collection_tx).is_err() {
+        bail!("control message sender alread set");
+    }
 
     tokio::spawn(async move {
-        let task_scheduler = pin!(metric_collection_task(tx));
+        let metric_collection_task_future = pin!(async move {
+            let mut task =
+                MetricCollectionTask::new(metric_data_tx, trigger_collection_rx).unwrap();
+            task.run().await;
+        });
+
         let abort_future = pin!(proxmox_daemon::shutdown_future());
-        futures::future::select(task_scheduler, abort_future).await;
+        futures::future::select(metric_collection_task_future, abort_future).await;
     });
 
     tokio::spawn(async move {
-        let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
+        let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
         let abort_future = pin!(proxmox_daemon::shutdown_future());
-        futures::future::select(task_scheduler, abort_future).await;
+        futures::future::select(rrd_task_future, abort_future).await;
     });
+
+    Ok(())
 }
 
-async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
-    let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
-
-    loop {
-        let delay_target = task_utils::next_aligned_instant(COLLECTION_INTERVAL);
-        tokio::time::sleep_until(tokio::time::Instant::from_std(delay_target)).await;
-
-        let remotes = match pdm_config::remotes::config() {
-            Ok((remotes, _)) => remotes,
-            Err(e) => {
-                log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
-                continue;
-            }
-        };
-
-        for (remote_name, remote) in &remotes.sections {
-            let start_time = *most_recent_timestamps.get(remote_name).unwrap_or(&0);
-            let remote_name_clone = remote_name.clone();
-
-            let res = async {
-                let most_recent_timestamp = match remote.ty {
-                    RemoteType::Pve => {
-                        let client = connection::make_pve_client(remote)?;
-                        let metrics = client
-                            .cluster_metrics_export(Some(true), Some(false), Some(start_time))
-                            .await?;
-
-                        let most_recent =
-                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
-                        sender
-                            .send(RrdStoreRequest::Pve {
-                                remote: remote_name_clone,
-                                metrics,
-                            })
-                            .await?;
-
-                        Ok::<i64, Error>(most_recent)
-                    }
-                    RemoteType::Pbs => {
-                        let client = connection::make_pbs_client(remote)?;
-                        let metrics = client.metrics(Some(true), Some(start_time)).await?;
-
-                        let most_recent =
-                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
-                        sender
-                            .send(RrdStoreRequest::Pbs {
-                                remote: remote_name_clone,
-                                metrics,
-                            })
-                            .await?;
-
-                        Ok::<i64, Error>(most_recent)
-                    }
-                }?;
-
-                Ok::<i64, Error>(most_recent_timestamp)
-            }
-            .await;
-
-            match res {
-                Ok(ts) => {
-                    most_recent_timestamps.insert(remote_name.to_string(), ts);
-                }
-                Err(err) => log::error!("failed to collect metrics for {remote_name}: {err}"),
-            }
-        }
+/// Schedule metric collection for a given remote as soon as possible.
+///
+/// Has no effect if the tx end of the channel has not been initialized yet.
+/// Returns an error if the mpsc channel has been closed already.
+pub async fn trigger_metric_collection_for_remote(remote: &str) -> Result<(), Error> {
+    if let Some(sender) = CONTROL_MESSAGE_TX.get() {
+        sender
+            .send(ControlMsg::CollectSingleRemote(remote.into()))
+            .await?;
     }
+
+    Ok(())
+}
+
+/// Schedule metric collection for all remotes as soon as possible.
+///
+/// Has no effect if the tx end of the channel has not been initialized yet.
+/// Returns an error if the mpsc channel has been closed already.
+pub async fn trigger_metric_collection() -> Result<(), Error> {
+    if let Some(sender) = CONTROL_MESSAGE_TX.get() {
+        sender.send(ControlMsg::CollectAllRemotes).await?;
+    }
+
+    Ok(())
 }
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 08/25] metric collection: persist state after metric collection
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (6 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 09/25] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

The metric collection has to maintain some state, at the moment this
is only the most recent timestamp from all received metric data points.
We use this as a cut-off time when requesting metric from the remote, as
everything *older* than that should already be stored in the database.

Up until now, the state was only stored in memory, which means that it
was lost when the daemon restarted.

The changes in this commit makes the metric collection system
load/save the state from a file in
/var/lib/proxmox-datacenter-manager/metric-collection-state.json

The following data points are stored for every remote:
- most-recent-datapoint:
  Timestamp of the most recent datapoint, used as a cut-off when
  fetching new metrics
- last-collection:
  `last-collection` field saves the timestamp of the last *successful*
  metric collection for a remote as local timestamp.
- error:
  String representation of any error that occured in the last collection
  attempt

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 .../src/metric_collection/collection_task.rs  | 134 ++++++++++++------
 server/src/metric_collection/mod.rs           |   1 +
 server/src/metric_collection/rrd_task.rs      |  57 ++++++--
 server/src/metric_collection/state.rs         | 128 +++++++++++++++++
 4 files changed, 268 insertions(+), 52 deletions(-)
 create mode 100644 server/src/metric_collection/state.rs

diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 8420039..211b6ee 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -1,16 +1,17 @@
-use std::{collections::HashMap, sync::Arc, time::Duration};
+use std::{sync::Arc, time::Duration};
 
 use anyhow::Error;
 use rand::Rng;
 use tokio::{
     sync::{
         mpsc::{Receiver, Sender},
-        OwnedSemaphorePermit, Semaphore,
+        oneshot, OwnedSemaphorePermit, Semaphore,
     },
     time::Interval,
 };
 
 use proxmox_section_config::typed::SectionConfigData;
+use proxmox_sys::fs::CreateOptions;
 
 use pdm_api_types::{
     remotes::{Remote, RemoteType},
@@ -20,7 +21,16 @@ use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
 
 use crate::{connection, task_utils};
 
-use super::rrd_task::RrdStoreRequest;
+use super::{
+    rrd_task::{RrdStoreRequest, RrdStoreResult},
+    state::{MetricCollectionState, RemoteStatus},
+};
+
+/// Location of the metric collection state file.
+const METRIC_COLLECTION_STATE_FILE: &str = concat!(
+    pdm_buildcfg::PDM_STATE_DIR_M!(),
+    "/metric-collection-state.json"
+);
 
 /// Control messages for the metric collection task.
 pub(super) enum ControlMsg {
@@ -31,7 +41,7 @@ pub(super) enum ControlMsg {
 /// Task which periodically collects metrics from all remotes and stores
 /// them in the local metrics database.
 pub(super) struct MetricCollectionTask {
-    most_recent_timestamps: HashMap<String, i64>,
+    state: MetricCollectionState,
     settings: CollectionSettings,
     metric_data_tx: Sender<RrdStoreRequest>,
     control_message_rx: Receiver<ControlMsg>,
@@ -44,9 +54,10 @@ impl MetricCollectionTask {
         control_message_rx: Receiver<ControlMsg>,
     ) -> Result<Self, Error> {
         let settings = Self::get_settings_or_default();
+        let state = load_state()?;
 
         Ok(Self {
-            most_recent_timestamps: HashMap::new(),
+            state,
             settings,
             metric_data_tx,
             control_message_rx,
@@ -102,6 +113,10 @@ impl MetricCollectionTask {
                 }
             }
 
+            if let Err(e) = self.state.save() {
+                log::error!("could not update metric collection state: {e}");
+            }
+
             let interval = self.settings.collection_interval_or_default();
 
             if old_settings.collection_interval_or_default() != interval {
@@ -228,7 +243,11 @@ impl MetricCollectionTask {
         let mut handles = Vec::new();
 
         for remote_name in remotes_to_fetch {
-            let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
+            let status = self
+                .state
+                .get_status(remote_name)
+                .cloned()
+                .unwrap_or_default();
 
             // unwrap is okay here, acquire_* will only fail if `close` has been
             // called on the semaphore.
@@ -239,7 +258,7 @@ impl MetricCollectionTask {
                 let handle = tokio::spawn(Self::fetch_single_remote(
                     self.settings.clone(),
                     remote,
-                    start_time,
+                    status,
                     self.metric_data_tx.clone(),
                     permit,
                 ));
@@ -253,8 +272,7 @@ impl MetricCollectionTask {
 
             match res {
                 Ok(Ok(ts)) => {
-                    self.most_recent_timestamps
-                        .insert(remote_name.to_string(), ts);
+                    self.state.set_status(remote_name, ts);
                 }
                 Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
                 Err(err) => {
@@ -271,47 +289,81 @@ impl MetricCollectionTask {
     async fn fetch_single_remote(
         settings: CollectionSettings,
         remote: Remote,
-        start_time: i64,
+        mut status: RemoteStatus,
         sender: Sender<RrdStoreRequest>,
         _permit: OwnedSemaphorePermit,
-    ) -> Result<i64, Error> {
+    ) -> Result<RemoteStatus, Error> {
         Self::sleep_for_random_connection_delay(&settings).await;
+        let (result_tx, result_rx) = oneshot::channel();
+
+        let now = proxmox_time::epoch_i64();
+
+        let res: Result<RrdStoreResult, Error> = async {
+            match remote.ty {
+                RemoteType::Pve => {
+                    let client = connection::make_pve_client(&remote)?;
+                    let metrics = client
+                        .cluster_metrics_export(
+                            Some(true),
+                            Some(false),
+                            Some(status.most_recent_datapoint),
+                        )
+                        .await?;
+
+                    sender
+                        .send(RrdStoreRequest::Pve {
+                            remote: remote.id.clone(),
+                            metrics,
+                            channel: result_tx,
+                        })
+                        .await?;
+                }
+                RemoteType::Pbs => {
+                    let client = connection::make_pbs_client(&remote)?;
+                    let metrics = client
+                        .metrics(Some(true), Some(status.most_recent_datapoint))
+                        .await?;
+
+                    sender
+                        .send(RrdStoreRequest::Pbs {
+                            remote: remote.id.clone(),
+                            metrics,
+                            channel: result_tx,
+                        })
+                        .await?;
+                }
+            }
 
-        let most_recent_timestamp = match remote.ty {
-            RemoteType::Pve => {
-                let client = connection::make_pve_client(&remote)?;
-                let metrics = client
-                    .cluster_metrics_export(Some(true), Some(false), Some(start_time))
-                    .await?;
-
-                let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+            result_rx.await.map_err(Error::from)
+        }
+        .await;
 
-                sender
-                    .send(RrdStoreRequest::Pve {
-                        remote: remote.id.clone(),
-                        metrics,
-                    })
-                    .await?;
+        match res {
+            Ok(result) => {
+                status.most_recent_datapoint = result.most_recent_timestamp;
+                status.last_collection = Some(now);
+                status.error = None;
+            }
 
-                most_recent
+            Err(err) => {
+                status.error = Some(err.to_string());
+                log::error!("coud not fetch metrics from '{}': {err}", remote.id);
             }
-            RemoteType::Pbs => {
-                let client = connection::make_pbs_client(&remote)?;
-                let metrics = client.metrics(Some(true), Some(start_time)).await?;
+        }
 
-                let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+        Ok(status)
+    }
+}
 
-                sender
-                    .send(RrdStoreRequest::Pbs {
-                        remote: remote.id.clone(),
-                        metrics,
-                    })
-                    .await?;
+/// Load the metric collection state file.
+pub(super) fn load_state() -> Result<MetricCollectionState, Error> {
+    let api_uid = pdm_config::api_user()?.uid;
+    let api_gid = pdm_config::api_group()?.gid;
 
-                most_recent
-            }
-        };
+    let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
 
-        Ok(most_recent_timestamp)
-    }
+    Ok(MetricCollectionState::new(
+        METRIC_COLLECTION_STATE_FILE.into(),
+        file_options,
+    ))
 }
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 0ab6678..cf3350b 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -7,6 +7,7 @@ use tokio::sync::mpsc::{self, Sender};
 mod collection_task;
 pub mod rrd_cache;
 mod rrd_task;
+mod state;
 pub mod top_entities;
 
 use collection_task::{ControlMsg, MetricCollectionTask};
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index fe19580..7ee56c5 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -1,8 +1,10 @@
 use anyhow::Error;
-use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
+use tokio::sync::{mpsc::Receiver, oneshot};
+
 use proxmox_rrd::rrd::DataSourceType;
+
+use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
 use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
-use tokio::sync::mpsc::Receiver;
 
 use super::rrd_cache;
 
@@ -14,6 +16,8 @@ pub(super) enum RrdStoreRequest {
         remote: String,
         /// Metric data
         metrics: ClusterMetrics,
+        /// Oneshot channel to return the [`RrdStoreResult`].
+        channel: oneshot::Sender<RrdStoreResult>,
     },
     /// Store PBS metrics.
     Pbs {
@@ -21,9 +25,17 @@ pub(super) enum RrdStoreRequest {
         remote: String,
         /// Metric data
         metrics: Metrics,
+        /// Oneshot channel to return the [`RrdStoreResult`].
+        channel: oneshot::Sender<RrdStoreResult>,
     },
 }
 
+/// Result for a [`RrdStoreRequest`].
+pub(super) struct RrdStoreResult {
+    /// Most recent timestamp of any stored metric datapoint.
+    pub(super) most_recent_timestamp: i64,
+}
+
 /// Task which stores received metrics in the RRD. Metric data is fed into
 /// this task via a MPSC channel.
 pub(super) async fn store_in_rrd_task(
@@ -31,17 +43,40 @@ pub(super) async fn store_in_rrd_task(
 ) -> Result<(), Error> {
     while let Some(msg) = receiver.recv().await {
         //// Involves some blocking file IO
-        tokio::task::spawn_blocking(move || match msg {
-            RrdStoreRequest::Pve { remote, metrics } => {
-                for data_point in metrics.data {
-                    store_metric_pve(&remote, &data_point);
+        tokio::task::spawn_blocking(move || {
+            let mut most_recent_timestamp = 0;
+            let channel = match msg {
+                RrdStoreRequest::Pve {
+                    remote,
+                    metrics,
+                    channel,
+                } => {
+                    for data_point in metrics.data {
+                        most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
+                        store_metric_pve(&remote, &data_point);
+                    }
+
+                    channel
                 }
-            }
-            RrdStoreRequest::Pbs { remote, metrics } => {
-                for data_point in metrics.data {
-                    store_metric_pbs(&remote, &data_point);
+                RrdStoreRequest::Pbs {
+                    remote,
+                    metrics,
+                    channel,
+                } => {
+                    for data_point in metrics.data {
+                        most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
+                        store_metric_pbs(&remote, &data_point);
+                    }
+
+                    channel
                 }
-            }
+            };
+
+            if let Err(_err) = channel.send(RrdStoreResult {
+                most_recent_timestamp,
+            }) {
+                log::error!("could not send RrdStoreStoreResult to metric collection task");
+            };
         })
         .await?;
     }
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
new file mode 100644
index 0000000..4e88a45
--- /dev/null
+++ b/server/src/metric_collection/state.rs
@@ -0,0 +1,128 @@
+use std::{
+    collections::HashMap,
+    path::{Path, PathBuf},
+};
+
+use anyhow::Error;
+use serde::{Deserialize, Serialize};
+
+use proxmox_sys::fs::CreateOptions;
+
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// Metric collection state file content.
+struct State {
+    remote_status: HashMap<String, RemoteStatus>,
+}
+
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// A remote's metric collection state.
+pub struct RemoteStatus {
+    /// Most recent datapoint - time stamp is based on remote time
+    pub most_recent_datapoint: i64,
+    /// Last successful metric collection - timestamp based on PDM's time
+    pub last_collection: Option<i64>,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    /// Any error that occured during the last metric collection attempt.
+    pub error: Option<String>,
+}
+
+/// Manage and persist metric collection state.
+pub struct MetricCollectionState {
+    /// Path to the persisted state
+    path: PathBuf,
+    /// File owner/perms for the persisted state file
+    file_options: CreateOptions,
+    /// The current state
+    state: State,
+}
+
+impl MetricCollectionState {
+    /// Initialize state by trying to load the existing statefile. If the file does not exist,
+    /// state will be empty. If the file  failed to load, state will be empty and
+    /// and error will be logged.
+    pub fn new(statefile: PathBuf, file_options: CreateOptions) -> Self {
+        let state = match Self::load_or_default(&statefile) {
+            Ok(state) => state,
+            Err(err) => {
+                log::error!("could not load metric collection state: {err}");
+                Default::default()
+            }
+        };
+
+        Self {
+            path: statefile,
+            file_options,
+            state,
+        }
+    }
+
+    /// Set a remote's status.
+    pub fn set_status(&mut self, remote: String, remote_state: RemoteStatus) {
+        self.state.remote_status.insert(remote, remote_state);
+    }
+
+    /// Get a remote's status.
+    pub fn get_status(&self, remote: &str) -> Option<&RemoteStatus> {
+        self.state.remote_status.get(remote)
+    }
+
+    /// Persist the state to the statefile.
+    pub fn save(&self) -> Result<(), Error> {
+        let data = serde_json::to_vec_pretty(&self.state)?;
+        proxmox_sys::fs::replace_file(&self.path, &data, self.file_options.clone(), true)?;
+
+        Ok(())
+    }
+
+    fn load_or_default(path: &Path) -> Result<State, Error> {
+        let content = proxmox_sys::fs::file_read_optional_string(path)?;
+
+        if let Some(content) = content {
+            Ok(serde_json::from_str(&content)?)
+        } else {
+            Ok(Default::default())
+        }
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use proxmox_sys::fs::CreateOptions;
+
+    use super::*;
+
+    use crate::test_support::temp::NamedTempFile;
+
+    fn get_options() -> CreateOptions {
+        CreateOptions::new()
+            .owner(nix::unistd::Uid::effective())
+            .group(nix::unistd::Gid::effective())
+            .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
+    }
+
+    #[test]
+    fn save_and_load() -> Result<(), Error> {
+        let file = NamedTempFile::new(get_options())?;
+        let options = get_options();
+        let mut state = MetricCollectionState::new(file.path().into(), options.clone());
+
+        state.set_status(
+            "some-remote".into(),
+            RemoteStatus {
+                most_recent_datapoint: 1234,
+                ..Default::default()
+            },
+        );
+
+        state.save()?;
+
+        let state = MetricCollectionState::new(file.path().into(), options);
+
+        let status = state.get_status("some-remote").unwrap();
+        assert_eq!(status.most_recent_datapoint, 1234);
+
+        Ok(())
+    }
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 09/25] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (7 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 08/25] metric collection: persist state after metric collection Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

It makes sense to limit the collection frequency per remote, especially
since we will have the ability to trigger collection manually.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/metric_collection/collection_task.rs | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 211b6ee..b4e3207 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -15,7 +15,7 @@ use proxmox_sys::fs::CreateOptions;
 
 use pdm_api_types::{
     remotes::{Remote, RemoteType},
-    CollectionSettings,
+    CollectionSettings, MIN_COLLECTION_INTERVAL,
 };
 use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
 
@@ -241,6 +241,7 @@ impl MetricCollectionTask {
             self.settings.max_concurrent_connections_or_default(),
         ));
         let mut handles = Vec::new();
+        let now = proxmox_time::epoch_i64();
 
         for remote_name in remotes_to_fetch {
             let status = self
@@ -249,6 +250,13 @@ impl MetricCollectionTask {
                 .cloned()
                 .unwrap_or_default();
 
+            if now - status.last_collection.unwrap_or(0) < MIN_COLLECTION_INTERVAL as i64 {
+                log::debug!(
+                    "skipping metric collection for remote '{remote_name}' - data is recent enough"
+                );
+                continue;
+            }
+
             // unwrap is okay here, acquire_* will only fail if `close` has been
             // called on the semaphore.
             let permit = semaphore.clone().acquire_owned().await.unwrap();
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (8 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 09/25] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-13  8:55   ` Wolfgang Bumiller
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 11/25] metric collection: add tests for the fetch_remotes function Lukas Wagner
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

Due to the fact that the timer fires at aligned points in time and might
now fire right away after being set up, it could happen that we get gaps
in the data if we change the timer interval or at daemon startup.

To mitigate this, on daemon startup and also if the collection interval
changes, we
  - check if the time until the next scheduled regular collection
    plus the time *since* the last successful collection exceeds
    the configured collection interval
  - if yes, we collect immediately
  - if no, we do nothing and let the remote be collected at the
    next timer tick

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 .../src/metric_collection/collection_task.rs  | 72 +++++++++++++++++--
 1 file changed, 65 insertions(+), 7 deletions(-)

diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index b4e3207..f0742ea 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -1,4 +1,7 @@
-use std::{sync::Arc, time::Duration};
+use std::{
+    sync::Arc,
+    time::{Duration, Instant},
+};
 
 use anyhow::Error;
 use rand::Rng;
@@ -69,13 +72,18 @@ impl MetricCollectionTask {
     /// This function does never return.
     #[tracing::instrument(skip_all, name = "metric_collection_task")]
     pub(super) async fn run(&mut self) {
-        let mut timer = Self::setup_timer(self.settings.collection_interval_or_default());
+        let (mut timer, mut next_run) =
+            Self::setup_timer(self.settings.collection_interval_or_default());
 
         log::debug!(
             "metric collection starting up. collection interval set to {} seconds",
             self.settings.collection_interval_or_default()
         );
 
+        // Check and fetch any remote which would be overdue by the time the
+        // timer first fires.
+        self.fetch_overdue_and_save_state(next_run).await;
+
         loop {
             let old_settings = self.settings.clone();
             tokio::select! {
@@ -124,7 +132,12 @@ impl MetricCollectionTask {
                     "metric collection interval changed to {} seconds, reloading timer",
                     interval
                 );
-                timer = Self::setup_timer(interval);
+                (timer, next_run) = Self::setup_timer(interval);
+                // If change (and therefore reset) our timer right before it fires,
+                // we could potentially miss one collection event.
+                // Therefore fetch all remotes which would be due for metric collection before
+                // the new timer fires.
+                self.fetch_overdue_and_save_state(next_run).await;
             }
         }
     }
@@ -208,12 +221,12 @@ impl MetricCollectionTask {
     /// Set up a [`tokio::time::Interval`] instance with the provided interval.
     /// The timer will be aligned, e.g. an interval of `60` will let the timer
     /// fire at minute boundaries.
-    fn setup_timer(interval: u64) -> Interval {
+    fn setup_timer(interval: u64) -> (Interval, Instant) {
         let mut timer = tokio::time::interval(Duration::from_secs(interval));
-        let first_run = task_utils::next_aligned_instant(interval).into();
-        timer.reset_at(first_run);
+        let first_run = task_utils::next_aligned_instant(interval);
+        timer.reset_at(first_run.into());
 
-        timer
+        (timer, first_run)
     }
 
     /// Convenience helper to load `remote.cfg`, logging the error
@@ -292,6 +305,51 @@ impl MetricCollectionTask {
         }
     }
 
+    /// Fetch metric data from remotes which are overdue for collection and save
+    /// collection state.
+    async fn fetch_overdue_and_save_state(&mut self, next_run: Instant) {
+        if let Some(remotes) = Self::load_remote_config() {
+            self.fetch_overdue(&remotes, next_run).await;
+            if let Err(e) = self.state.save() {
+                log::error!("could not update metric collection state: {e}");
+            }
+        }
+    }
+
+    /// Fetch metric data from remotes which are overdue for collection.
+    ///
+    /// Use this on startup of the metric collection loop as well as
+    /// when the collection interval changes.
+    ///
+    /// Does nothing if the remote config could not be read, in this case an
+    /// error is logged.
+    async fn fetch_overdue(&mut self, remotes: &SectionConfigData<Remote>, next_run: Instant) {
+        let left_until_scheduled = next_run - Instant::now();
+        let now = proxmox_time::epoch_i64();
+
+        let mut overdue = Vec::new();
+
+        for remote in &remotes.order {
+            let last_collection = self
+                .state
+                .get_status(remote)
+                .and_then(|s| s.last_collection)
+                .unwrap_or(0);
+
+            let diff = now - last_collection;
+
+            if diff + left_until_scheduled.as_secs() as i64
+                > self.settings.collection_interval_or_default() as i64
+            {
+                log::debug!(
+                    "starting metric collection for remote '{remote}' - triggered because collection is overdue"
+                );
+                overdue.push(remote.clone());
+            }
+        }
+        self.fetch_remotes(remotes, &overdue).await;
+    }
+
     /// Fetch a single remote.
     #[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
     async fn fetch_single_remote(
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 11/25] metric collection: add tests for the fetch_remotes function
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (9 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 12/25] metric collection: add test for fetch_overdue Lukas Wagner
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

Since we have the option to have mocked remote clients and a mocked
remote config, as well as due to the decoupling of the metric collection
and RRD task, it is now viable to write a unit test for one of the core
pieces of the metric collection system.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 .../src/metric_collection/collection_task.rs  | 232 ++++++++++++++++++
 server/src/metric_collection/state.rs         |  16 +-
 2 files changed, 236 insertions(+), 12 deletions(-)

diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index f0742ea..b07a933 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -433,3 +433,235 @@ pub(super) fn load_state() -> Result<MetricCollectionState, Error> {
         file_options,
     ))
 }
+
+#[cfg(test)]
+pub(super) mod tests {
+    use std::{collections::HashMap, sync::Once};
+
+    use anyhow::bail;
+    use http::StatusCode;
+
+    use pdm_api_types::Authid;
+    use pve_api_types::{client::PveClient, ClusterMetrics, ClusterMetricsData};
+
+    use crate::{
+        connection::ClientFactory, metric_collection::rrd_task::RrdStoreResult,
+        pbs_client::PbsClient, test_support::temp::NamedTempFile,
+    };
+
+    use super::*;
+
+    pub(crate) fn get_create_options() -> CreateOptions {
+        CreateOptions::new()
+            .owner(nix::unistd::Uid::effective())
+            .group(nix::unistd::Gid::effective())
+            .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
+    }
+
+    struct TestClientFactory {
+        now: i64,
+    }
+
+    #[async_trait::async_trait]
+    impl ClientFactory for TestClientFactory {
+        fn make_pve_client(
+            &self,
+            remote: &Remote,
+        ) -> Result<Box<dyn PveClient + Send + Sync>, Error> {
+            Ok(Box::new(TestPveClient {
+                fail: remote.id.contains("fail"),
+                now: self.now,
+            }))
+        }
+        /// Create a new API client for PVE remotes, but with a specific endpoint.
+        fn make_pve_client_with_endpoint(
+            &self,
+            _remote: &Remote,
+            _target_endpoint: Option<&str>,
+        ) -> Result<Box<dyn PveClient + Send + Sync>, Error> {
+            bail!("not implemented")
+        }
+
+        fn make_pbs_client(&self, _remote: &Remote) -> Result<Box<PbsClient>, Error> {
+            bail!("not implemented")
+        }
+
+        async fn make_pve_client_and_login(
+            &self,
+            _remote: &Remote,
+        ) -> Result<Box<dyn PveClient + Send + Sync>, Error> {
+            bail!("not implemented")
+        }
+
+        async fn make_pbs_client_and_login(
+            &self,
+            _remote: &Remote,
+        ) -> Result<Box<PbsClient>, Error> {
+            bail!("not implemented")
+        }
+    }
+
+    struct TestPveClient {
+        now: i64,
+        fail: bool,
+    }
+
+    #[async_trait::async_trait]
+    impl PveClient for TestPveClient {
+        /// Retrieve metrics of the cluster.
+        async fn cluster_metrics_export(
+            &self,
+            _history: Option<bool>,
+            _local_only: Option<bool>,
+            start_time: Option<i64>,
+        ) -> Result<ClusterMetrics, proxmox_client::Error> {
+            if self.fail {
+                return Err(proxmox_client::Error::Api(
+                    StatusCode::INTERNAL_SERVER_ERROR,
+                    "internal server error".into(),
+                ));
+            }
+
+            let mut time = start_time.unwrap_or(0);
+            time = time.max(self.now - (30 * 60));
+            let mut data = Vec::new();
+
+            use pve_api_types::ClusterMetricsDataType::*;
+
+            while time < self.now {
+                let point = |id: &str, metric: &str, timestamp, ty| ClusterMetricsData {
+                    id: id.into(),
+                    metric: metric.into(),
+                    timestamp,
+                    ty,
+                    value: 0.1,
+                };
+
+                for i in 0..5 {
+                    let id = format!("node/node-{i}");
+                    data.push(point(&id, "cpu_current", time, Gauge));
+                }
+
+                // Advance time by 10 seconds
+                time += 10;
+            }
+
+            Ok(ClusterMetrics { data })
+        }
+    }
+
+    fn make_remote_config() -> SectionConfigData<Remote> {
+        let mut order = Vec::new();
+        let mut sections = HashMap::new();
+
+        for i in 0..4 {
+            let status = if i >= 2 { "fail" } else { "pass" };
+            let name = format!("pve-{i}-{status}");
+
+            sections.insert(
+                name.clone(),
+                Remote {
+                    ty: pdm_api_types::remotes::RemoteType::Pve,
+                    id: name.clone(),
+                    nodes: Vec::new(),
+                    authid: Authid::root_auth_id().clone(),
+                    token: "".into(),
+                    web_url: None,
+                },
+            );
+
+            order.push(name);
+        }
+
+        SectionConfigData { sections, order }
+    }
+
+    async fn fake_rrd_task(mut rx: Receiver<RrdStoreRequest>) -> u32 {
+        let mut number_of_requests = 0;
+
+        while let Some(request) = rx.recv().await {
+            number_of_requests += 1;
+
+            match a {
+                RrdStoreRequest::Pve {
+                    remote: _,
+                    metrics,
+                    channel,
+                } => {
+                    let most_recent_timestamp =
+                        metrics.data.iter().fold(0, |acc, e| acc.max(e.timestamp));
+
+                    let _ = channel.send(RrdStoreResult {
+                        most_recent_timestamp,
+                    });
+                }
+                _ => {}
+            }
+        }
+
+        number_of_requests
+    }
+
+    static START: Once = Once::new();
+
+    fn test_init() -> i64 {
+        let now = 10000;
+        START.call_once(|| {
+            // TODO: the client factory is currently stored in a OnceLock -
+            // we can only set it from one test... Ideally we'd like to have the
+            // option to set it in every single test if needed - task/thread local?
+            connection::init(Box::new(TestClientFactory { now }));
+        });
+
+        now
+    }
+
+    #[tokio::test]
+    async fn test_fetch_remotes_updates_state() {
+        // Arrange
+        let now = test_init();
+
+        let (tx, rx) = tokio::sync::mpsc::channel(10);
+        let handle = tokio::task::spawn(fake_rrd_task(rx));
+
+        let settings = CollectionSettings::default();
+        let config = make_remote_config();
+
+        let state_file = NamedTempFile::new(get_create_options()).unwrap();
+        let state = MetricCollectionState::new(state_file.path().into(), get_create_options());
+
+        let (_control_tx, control_rx) = tokio::sync::mpsc::channel(10);
+
+        let mut task = MetricCollectionTask {
+            state,
+            settings,
+            metric_data_tx: tx,
+            control_message_rx: control_rx,
+        };
+
+        // Act
+        task.fetch_remotes(&config, &config.order).await;
+
+        // Assert
+        for remote in config.order {
+            let status = task.state.get_status(&remote).unwrap();
+
+            // Our faked PVE client will return an error if the remote name contains
+            // 'fail'.
+            if remote.contains("fail") {
+                assert!(status
+                    .error
+                    .as_ref()
+                    .unwrap()
+                    .contains("internal server error"));
+                assert_eq!(status.last_collection, None);
+            } else {
+                assert!(now - status.most_recent_datapoint <= 10);
+                assert!(status.error.is_none());
+            }
+        }
+
+        drop(task);
+        assert_eq!(handle.await.unwrap(), 2);
+    }
+}
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
index 4e88a45..a9f712d 100644
--- a/server/src/metric_collection/state.rs
+++ b/server/src/metric_collection/state.rs
@@ -89,23 +89,15 @@ impl MetricCollectionState {
 
 #[cfg(test)]
 mod tests {
-    use proxmox_sys::fs::CreateOptions;
-
-    use super::*;
-
+    use crate::metric_collection::collection_task::tests::get_create_options;
     use crate::test_support::temp::NamedTempFile;
 
-    fn get_options() -> CreateOptions {
-        CreateOptions::new()
-            .owner(nix::unistd::Uid::effective())
-            .group(nix::unistd::Gid::effective())
-            .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
-    }
+    use super::*;
 
     #[test]
     fn save_and_load() -> Result<(), Error> {
-        let file = NamedTempFile::new(get_options())?;
-        let options = get_options();
+        let file = NamedTempFile::new(get_create_options())?;
+        let options = get_create_options();
         let mut state = MetricCollectionState::new(file.path().into(), options.clone());
 
         state.set_status(
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 12/25] metric collection: add test for fetch_overdue
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (10 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 11/25] metric collection: add tests for the fetch_remotes function Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 13/25] metric collection: pass rrd cache instance as function parameter Lukas Wagner
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This test ensure that the logic for fetching metrics from remotes which
are overdue for collection continues to work as expected.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 .../src/metric_collection/collection_task.rs  | 62 +++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index b07a933..f349056 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -664,4 +664,66 @@ pub(super) mod tests {
         drop(task);
         assert_eq!(handle.await.unwrap(), 2);
     }
+
+    #[tokio::test]
+    async fn test_fetch_overdue() {
+        // Arrange
+        test_init();
+
+        let (tx, rx) = tokio::sync::mpsc::channel(10);
+        let handle = tokio::task::spawn(fake_rrd_task(rx));
+
+        let settings = CollectionSettings {
+            collection_interval: Some(60),
+            ..Default::default()
+        };
+
+        let config = make_remote_config();
+
+        let state_file = NamedTempFile::new(get_create_options()).unwrap();
+        let mut state = MetricCollectionState::new(state_file.path().into(), get_create_options());
+
+        let now = proxmox_time::epoch_i64();
+
+        // This one should be fetched
+        state.set_status(
+            "pve-0-pass".into(),
+            RemoteStatus {
+                last_collection: Some(now - 35),
+                ..Default::default()
+            },
+        );
+        // This one should *not* be fetched
+        state.set_status(
+            "pve-1-pass".into(),
+            RemoteStatus {
+                last_collection: Some(now - 25),
+                ..Default::default()
+            },
+        );
+
+        let (_control_tx, control_rx) = tokio::sync::mpsc::channel(10);
+
+        let mut task = MetricCollectionTask {
+            state,
+            settings,
+            metric_data_tx: tx,
+            control_message_rx: control_rx,
+        };
+
+        let next_collection = Instant::now() + Duration::from_secs(30);
+
+        // Act
+        task.fetch_overdue(&config, next_collection).await;
+
+        // Assert
+        let status = task.state.get_status("pve-0-pass").unwrap();
+        assert!(status.last_collection.unwrap() - now >= 0);
+
+        let status = task.state.get_status("pve-1-pass").unwrap();
+        assert_eq!(status.last_collection.unwrap(), now - 25);
+
+        drop(task);
+        assert_eq!(handle.await.unwrap(), 1);
+    }
 }
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 13/25] metric collection: pass rrd cache instance as function parameter
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (11 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 12/25] metric collection: add test for fetch_overdue Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 14/25] metric collection: add test for rrd task Lukas Wagner
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This enables us to do dependency injection for tests by passing in the
rrd cache to use, instead of having to replace the global static
instance. The latter is always awkward because tests might run
multi-threaded, so the global instance could/should be a thread-local,
but this again is weird in tokio-world where we actually want task-local
variables, which in turn are weird once you actually start new tasks,
which then don't access the same task-local variables as their parent
task... - long story short, passing in the dependency as a parameter
makes things easier.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/rrd_common.rs                 |  4 +-
 server/src/metric_collection/mod.rs          | 15 +++-
 server/src/metric_collection/rrd_cache.rs    | 77 ++++++++++++--------
 server/src/metric_collection/rrd_task.rs     | 16 ++--
 server/src/metric_collection/top_entities.rs |  3 +
 5 files changed, 75 insertions(+), 40 deletions(-)

diff --git a/server/src/api/rrd_common.rs b/server/src/api/rrd_common.rs
index 0d82d0c..d9ed017 100644
--- a/server/src/api/rrd_common.rs
+++ b/server/src/api/rrd_common.rs
@@ -23,9 +23,11 @@ pub fn create_datapoints_from_rrd<T: DataPoint>(
     let mut timemap = BTreeMap::new();
     let mut last_resolution = None;
 
+    let cache = rrd_cache::get_cache();
+
     for name in T::fields() {
         let (start, resolution, data) =
-            match rrd_cache::extract_data(basedir, name, timeframe, mode)? {
+            match rrd_cache::extract_data(&cache, basedir, name, timeframe, mode)? {
                 Some(data) => data.into(),
                 None => continue,
             };
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index cf3350b..e41f0c6 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -4,6 +4,8 @@ use std::sync::OnceLock;
 use anyhow::{bail, Error};
 use tokio::sync::mpsc::{self, Sender};
 
+use proxmox_sys::fs::CreateOptions;
+
 mod collection_task;
 pub mod rrd_cache;
 mod rrd_task;
@@ -16,7 +18,14 @@ static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
 
 /// Initialize the RRD cache
 pub fn init() -> Result<(), Error> {
-    rrd_cache::init()?;
+    let api_uid = pdm_config::api_user()?.uid;
+    let api_gid = pdm_config::api_group()?.gid;
+
+    let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
+    let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
+
+    let cache = rrd_cache::init(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
+    rrd_cache::set_cache(cache)?;
 
     Ok(())
 }
@@ -41,8 +50,10 @@ pub fn start_task() -> Result<(), Error> {
         futures::future::select(metric_collection_task_future, abort_future).await;
     });
 
+    let cache = rrd_cache::get_cache();
+
     tokio::spawn(async move {
-        let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
+        let rrd_task_future = pin!(rrd_task::store_in_rrd_task(cache, metric_data_rx));
         let abort_future = pin!(proxmox_daemon::shutdown_future());
         futures::future::select(rrd_task_future, abort_future).await;
     });
diff --git a/server/src/metric_collection/rrd_cache.rs b/server/src/metric_collection/rrd_cache.rs
index a8d72b8..424d476 100644
--- a/server/src/metric_collection/rrd_cache.rs
+++ b/server/src/metric_collection/rrd_cache.rs
@@ -5,6 +5,7 @@
 //! and update RRD data inside `proxmox-datacenter-api`.
 
 use std::path::Path;
+use std::sync::Arc;
 
 use anyhow::{format_err, Error};
 use once_cell::sync::OnceCell;
@@ -16,32 +17,45 @@ use proxmox_sys::fs::CreateOptions;
 
 use pdm_buildcfg::PDM_STATE_DIR_M;
 
-const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
+pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
 
-static RRD_CACHE: OnceCell<Cache> = OnceCell::new();
+// This is an `Arc` because this makes it easier to do dependency injection
+// in test contexts.
+//
+// For DI in testing, we want to pass in a reference to the Cache
+// as a function parameter. In a couple of these functions we
+// spawn tokio tasks which need access to the reference, hence the
+// reference needs to be 'static. In context, we kind of have a
+// hard time to come up if a 'static reference, so we just
+// wrap the cache in an `Arc` for now, solving the
+// lifetime problem via refcounting.
+static RRD_CACHE: OnceCell<Arc<Cache>> = OnceCell::new();
 
 /// Get the RRD cache instance
-fn get_cache() -> Result<&'static Cache, Error> {
+pub fn get_cache() -> Arc<Cache> {
+    RRD_CACHE.get().cloned().expect("rrd cache not initialized")
+}
+
+pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
     RRD_CACHE
-        .get()
-        .ok_or_else(|| format_err!("RRD cache not initialized!"))
+        .set(cache)
+        .map_err(|_| format_err!("RRD cache already initialized!"))?;
+
+    Ok(())
 }
 
 /// Initialize the RRD cache instance
 ///
 /// Note: Only a single process must do this (proxmox-datacenter-api)
-pub fn init() -> Result<&'static Cache, Error> {
-    let api_uid = pdm_config::api_user()?.uid;
-    let api_gid = pdm_config::api_group()?.gid;
-
-    let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
-    let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
+pub fn init<P: AsRef<Path>>(
+    base_path: P,
+    dir_options: CreateOptions,
+    file_options: CreateOptions,
+) -> Result<Arc<Cache>, Error> {
     let apply_interval = 30.0 * 60.0; // 30 minutes
 
     let cache = Cache::new(
-        RRD_CACHE_BASEDIR,
+        base_path,
         Some(file_options),
         Some(dir_options),
         apply_interval,
@@ -51,11 +65,7 @@ pub fn init() -> Result<&'static Cache, Error> {
 
     cache.apply_journal()?;
 
-    RRD_CACHE
-        .set(cache)
-        .map_err(|_| format_err!("RRD cache already initialized!"))?;
-
-    Ok(RRD_CACHE.get().unwrap())
+    Ok(Arc::new(cache))
 }
 
 fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
@@ -91,6 +101,7 @@ fn create_callback(dst: DataSourceType) -> Database {
 
 /// Extracts data for the specified time frame from RRD cache
 pub fn extract_data(
+    rrd_cache: &Cache,
     basedir: &str,
     name: &str,
     timeframe: RrdTimeframe,
@@ -112,26 +123,28 @@ pub fn extract_data(
         RrdMode::Average => AggregationFn::Average,
     };
 
-    let rrd_cache = get_cache()?;
-
     rrd_cache.extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
 }
 
 /// Sync/Flush the RRD journal
 pub fn sync_journal() {
-    if let Ok(rrd_cache) = get_cache() {
-        if let Err(err) = rrd_cache.sync_journal() {
-            log::error!("rrd_sync_journal failed - {err}");
-        }
+    let rrd_cache = get_cache();
+    if let Err(err) = rrd_cache.sync_journal() {
+        log::error!("rrd_sync_journal failed - {err}");
     }
 }
+
 /// Update RRD Gauge values
-pub fn update_value(name: &str, value: f64, timestamp: i64, datasource_type: DataSourceType) {
-    if let Ok(rrd_cache) = get_cache() {
-        if let Err(err) =
-            rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
-        {
-            log::error!("rrd::update_value '{name}' failed - {err}");
-        }
+pub fn update_value(
+    rrd_cache: &Cache,
+    name: &str,
+    value: f64,
+    timestamp: i64,
+    datasource_type: DataSourceType,
+) {
+    if let Err(err) =
+        rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
+    {
+        log::error!("rrd::update_value '{name}' failed - {err}");
     }
 }
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 7ee56c5..a678bb5 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -1,7 +1,9 @@
+use std::sync::Arc;
+
 use anyhow::Error;
 use tokio::sync::{mpsc::Receiver, oneshot};
 
-use proxmox_rrd::rrd::DataSourceType;
+use proxmox_rrd::{rrd::DataSourceType, Cache};
 
 use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
 use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
@@ -39,9 +41,11 @@ pub(super) struct RrdStoreResult {
 /// Task which stores received metrics in the RRD. Metric data is fed into
 /// this task via a MPSC channel.
 pub(super) async fn store_in_rrd_task(
+    cache: Arc<Cache>,
     mut receiver: Receiver<RrdStoreRequest>,
 ) -> Result<(), Error> {
     while let Some(msg) = receiver.recv().await {
+        let cache_clone = Arc::clone(&cache);
         //// Involves some blocking file IO
         tokio::task::spawn_blocking(move || {
             let mut most_recent_timestamp = 0;
@@ -53,7 +57,7 @@ pub(super) async fn store_in_rrd_task(
                 } => {
                     for data_point in metrics.data {
                         most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
-                        store_metric_pve(&remote, &data_point);
+                        store_metric_pve(&cache_clone, &remote, &data_point);
                     }
 
                     channel
@@ -65,7 +69,7 @@ pub(super) async fn store_in_rrd_task(
                 } => {
                     for data_point in metrics.data {
                         most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
-                        store_metric_pbs(&remote, &data_point);
+                        store_metric_pbs(&cache_clone, &remote, &data_point);
                     }
 
                     channel
@@ -84,7 +88,7 @@ pub(super) async fn store_in_rrd_task(
     Ok(())
 }
 
-fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
+fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetricsData) {
     let name = format!(
         "pve/{remote_name}/{id}/{metric}",
         id = data_point.id,
@@ -98,6 +102,7 @@ fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
     };
 
     rrd_cache::update_value(
+        cache,
         &name,
         data_point.value,
         data_point.timestamp,
@@ -105,7 +110,7 @@ fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
     );
 }
 
-fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
+fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoint) {
     let name = format!(
         "pbs/{remote_name}/{id}/{metric}",
         id = data_point.id,
@@ -119,6 +124,7 @@ fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
     };
 
     rrd_cache::update_value(
+        cache,
         &name,
         data_point.value,
         data_point.timestamp,
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
index f8e053f..f54ee72 100644
--- a/server/src/metric_collection/top_entities.rs
+++ b/server/src/metric_collection/top_entities.rs
@@ -121,7 +121,10 @@ fn get_entity(
     name: String,
     metric: &str,
 ) -> Option<(f64, TopEntity)> {
+    let cache = rrd_cache::get_cache();
+
     if let Ok(Some(values)) = rrd_cache::extract_data(
+        &cache,
         &name,
         metric,
         timeframe,
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 14/25] metric collection: add test for rrd task
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (12 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 13/25] metric collection: pass rrd cache instance as function parameter Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 15/25] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This commit adds a quick smoke-test for the `store_in_rrd` task
function. Unfortunately, due to some race-condition in proxmox_rrd,
some extra measures had to be taken in order to make the test not flaky.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/metric_collection/rrd_task.rs | 93 ++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index a678bb5..e5abb2d 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -131,3 +131,96 @@ fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoi
         data_source_type,
     );
 }
+
+#[cfg(test)]
+mod tests {
+    use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
+    use pve_api_types::{ClusterMetrics, ClusterMetricsData};
+
+    use crate::{
+        metric_collection::collection_task::tests::get_create_options,
+        test_support::temp::NamedTempDir,
+    };
+
+    use super::*;
+
+    #[tokio::test]
+    async fn test_rrd_task_persists_data() -> Result<(), Error> {
+        // Arrange
+        let dir = NamedTempDir::new()?;
+        let options = get_create_options().perm(nix::sys::stat::Mode::from_bits_truncate(0o700));
+        let cache = rrd_cache::init(&dir.path(), options.clone(), options.clone())?;
+
+        let (tx, rx) = tokio::sync::mpsc::channel(10);
+        let task = store_in_rrd_task(Arc::clone(&cache), rx);
+        let handle = tokio::task::spawn(task);
+
+        let now = proxmox_time::epoch_i64();
+
+        let metrics = ClusterMetrics {
+            data: vec![
+                ClusterMetricsData {
+                    id: "node/some-node".into(),
+                    metric: "cpu_current".into(),
+                    timestamp: now - 30,
+                    ty: ClusterMetricsDataType::Gauge,
+                    value: 0.1,
+                },
+                ClusterMetricsData {
+                    id: "node/some-node".into(),
+                    metric: "cpu_current".into(),
+                    timestamp: now - 20,
+                    ty: ClusterMetricsDataType::Gauge,
+                    value: 0.2,
+                },
+                ClusterMetricsData {
+                    id: "node/some-node".into(),
+                    metric: "cpu_current".into(),
+                    timestamp: now - 10,
+                    ty: ClusterMetricsDataType::Gauge,
+                    value: 0.1,
+                },
+                ClusterMetricsData {
+                    id: "node/some-node".into(),
+                    metric: "cpu_current".into(),
+                    timestamp: now,
+                    ty: ClusterMetricsDataType::Gauge,
+                    value: 0.2,
+                },
+            ],
+        };
+        let (tx_back, rx_back) = tokio::sync::oneshot::channel();
+        let request = RrdStoreRequest::Pve {
+            remote: "some-remote".into(),
+            metrics,
+            channel: tx_back,
+        };
+
+        // Act
+        tx.send(request).await?;
+        let result = rx_back.await?;
+
+        // Assert
+        assert_eq!(result.most_recent_timestamp, now);
+
+        drop(tx);
+        handle.await??;
+
+        // There is some race condition in proxmox_rrd, in some rare cases
+        // extract_data does not return any data directly after writing.
+        if let Some(data) = rrd_cache::extract_data(
+            &cache,
+            "pve/some-remote/node/some-node",
+            "cpu_current",
+            RrdTimeframe::Hour,
+            RrdMode::Max,
+        )? {
+            // Only assert that there are some data points, the
+            // exact position in the vec might vary due to
+            // changed boundaries
+            assert!(data.data.iter().any(Option::is_some));
+        }
+
+        Ok(())
+    }
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 15/25] metric collection: wrap rrd_cache::Cache in a struct
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (13 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 14/25] metric collection: add test for rrd task Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 16/25] metric collection: record remote response time in metric database Lukas Wagner
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

All helper functions storing/retrieving helper functions take the cache
instance as a first parameter. This smells like it should be a method on
a struct.

This commit wraps the foreign `proxmox_rrd::Cache` type in a new type,
transforms the `init` function into a `new` method and all other helpers
into further methods.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/rrd_common.rs                 |   9 +-
 server/src/metric_collection/mod.rs          |   6 +-
 server/src/metric_collection/rrd_cache.rs    | 191 +++++++++----------
 server/src/metric_collection/rrd_task.rs     |  21 +-
 server/src/metric_collection/top_entities.rs |   3 +-
 5 files changed, 113 insertions(+), 117 deletions(-)

diff --git a/server/src/api/rrd_common.rs b/server/src/api/rrd_common.rs
index d9ed017..28868bc 100644
--- a/server/src/api/rrd_common.rs
+++ b/server/src/api/rrd_common.rs
@@ -26,11 +26,10 @@ pub fn create_datapoints_from_rrd<T: DataPoint>(
     let cache = rrd_cache::get_cache();
 
     for name in T::fields() {
-        let (start, resolution, data) =
-            match rrd_cache::extract_data(&cache, basedir, name, timeframe, mode)? {
-                Some(data) => data.into(),
-                None => continue,
-            };
+        let (start, resolution, data) = match cache.extract_data(basedir, name, timeframe, mode)? {
+            Some(data) => data.into(),
+            None => continue,
+        };
 
         if let Some(expected_resolution) = last_resolution {
             if resolution != expected_resolution {
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index e41f0c6..6551f2b 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -1,4 +1,5 @@
 use std::pin::pin;
+use std::sync::Arc;
 use std::sync::OnceLock;
 
 use anyhow::{bail, Error};
@@ -13,6 +14,7 @@ mod state;
 pub mod top_entities;
 
 use collection_task::{ControlMsg, MetricCollectionTask};
+use rrd_cache::RrdCache;
 
 static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
 
@@ -24,8 +26,8 @@ pub fn init() -> Result<(), Error> {
     let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
     let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
 
-    let cache = rrd_cache::init(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
-    rrd_cache::set_cache(cache)?;
+    let cache = RrdCache::new(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
+    rrd_cache::set_cache(Arc::new(cache))?;
 
     Ok(())
 }
diff --git a/server/src/metric_collection/rrd_cache.rs b/server/src/metric_collection/rrd_cache.rs
index 424d476..5e43aaa 100644
--- a/server/src/metric_collection/rrd_cache.rs
+++ b/server/src/metric_collection/rrd_cache.rs
@@ -29,14 +29,14 @@ pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
 // hard time to come up if a 'static reference, so we just
 // wrap the cache in an `Arc` for now, solving the
 // lifetime problem via refcounting.
-static RRD_CACHE: OnceCell<Arc<Cache>> = OnceCell::new();
+static RRD_CACHE: OnceCell<Arc<RrdCache>> = OnceCell::new();
 
 /// Get the RRD cache instance
-pub fn get_cache() -> Arc<Cache> {
+pub fn get_cache() -> Arc<RrdCache> {
     RRD_CACHE.get().cloned().expect("rrd cache not initialized")
 }
 
-pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
+pub fn set_cache(cache: Arc<RrdCache>) -> Result<(), Error> {
     RRD_CACHE
         .set(cache)
         .map_err(|_| format_err!("RRD cache already initialized!"))?;
@@ -44,107 +44,106 @@ pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
     Ok(())
 }
 
-/// Initialize the RRD cache instance
-///
-/// Note: Only a single process must do this (proxmox-datacenter-api)
-pub fn init<P: AsRef<Path>>(
-    base_path: P,
-    dir_options: CreateOptions,
-    file_options: CreateOptions,
-) -> Result<Arc<Cache>, Error> {
-    let apply_interval = 30.0 * 60.0; // 30 minutes
-
-    let cache = Cache::new(
-        base_path,
-        Some(file_options),
-        Some(dir_options),
-        apply_interval,
-        load_callback,
-        create_callback,
-    )?;
-
-    cache.apply_journal()?;
-
-    Ok(Arc::new(cache))
+/// Wrapper for proxmox_rrd::Cache to accomodate helper methods.
+pub struct RrdCache {
+    cache: Cache,
 }
 
-fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
-    match Database::load(path, true) {
-        Ok(rrd) => Some(rrd),
-        Err(err) => {
-            if err.kind() != std::io::ErrorKind::NotFound {
-                log::warn!("overwriting RRD file {path:?}, because of load error: {err}",);
+impl RrdCache {
+    /// Create a new RrdCache instance
+    pub fn new<P: AsRef<Path>>(
+        base_path: P,
+        dir_options: CreateOptions,
+        file_options: CreateOptions,
+    ) -> Result<Self, Error> {
+        let apply_interval = 30.0 * 60.0; // 30 minutes
+
+        let cache = Cache::new(
+            base_path,
+            Some(file_options),
+            Some(dir_options),
+            apply_interval,
+            Self::load_callback,
+            Self::create_callback,
+        )?;
+
+        cache.apply_journal()?;
+
+        Ok(Self { cache })
+    }
+
+    fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
+        match Database::load(path, true) {
+            Ok(rrd) => Some(rrd),
+            Err(err) => {
+                if err.kind() != std::io::ErrorKind::NotFound {
+                    log::warn!("overwriting RRD file {path:?}, because of load error: {err}",);
+                }
+                None
             }
-            None
         }
     }
-}
-
-fn create_callback(dst: DataSourceType) -> Database {
-    let rra_list = vec![
-        // 1 min * 1440 => 1 day
-        Archive::new(AggregationFn::Average, 60, 1440),
-        Archive::new(AggregationFn::Maximum, 60, 1440),
-        // 30 min * 1440 => 30 days ~ 1 month
-        Archive::new(AggregationFn::Average, 30 * 60, 1440),
-        Archive::new(AggregationFn::Maximum, 30 * 60, 1440),
-        // 6 h * 1440 => 360 days ~ 1 year
-        Archive::new(AggregationFn::Average, 6 * 3600, 1440),
-        Archive::new(AggregationFn::Maximum, 6 * 3600, 1440),
-        // 1 week * 570 => 10 years
-        Archive::new(AggregationFn::Average, 7 * 86400, 570),
-        Archive::new(AggregationFn::Maximum, 7 * 86400, 570),
-    ];
-
-    Database::new(dst, rra_list)
-}
 
-/// Extracts data for the specified time frame from RRD cache
-pub fn extract_data(
-    rrd_cache: &Cache,
-    basedir: &str,
-    name: &str,
-    timeframe: RrdTimeframe,
-    mode: RrdMode,
-) -> Result<Option<proxmox_rrd::Entry>, Error> {
-    let end = proxmox_time::epoch_f64() as u64;
-
-    let (start, resolution) = match timeframe {
-        RrdTimeframe::Hour => (end - 3600, 60),
-        RrdTimeframe::Day => (end - 3600 * 24, 60),
-        RrdTimeframe::Week => (end - 3600 * 24 * 7, 30 * 60),
-        RrdTimeframe::Month => (end - 3600 * 24 * 30, 30 * 60),
-        RrdTimeframe::Year => (end - 3600 * 24 * 365, 6 * 60 * 60),
-        RrdTimeframe::Decade => (end - 10 * 3600 * 24 * 366, 7 * 86400),
-    };
-
-    let cf = match mode {
-        RrdMode::Max => AggregationFn::Maximum,
-        RrdMode::Average => AggregationFn::Average,
-    };
-
-    rrd_cache.extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
-}
+    fn create_callback(dst: DataSourceType) -> Database {
+        let rra_list = vec![
+            // 1 min * 1440 => 1 day
+            Archive::new(AggregationFn::Average, 60, 1440),
+            Archive::new(AggregationFn::Maximum, 60, 1440),
+            // 30 min * 1440 => 30 days ~ 1 month
+            Archive::new(AggregationFn::Average, 30 * 60, 1440),
+            Archive::new(AggregationFn::Maximum, 30 * 60, 1440),
+            // 6 h * 1440 => 360 days ~ 1 year
+            Archive::new(AggregationFn::Average, 6 * 3600, 1440),
+            Archive::new(AggregationFn::Maximum, 6 * 3600, 1440),
+            // 1 week * 570 => 10 years
+            Archive::new(AggregationFn::Average, 7 * 86400, 570),
+            Archive::new(AggregationFn::Maximum, 7 * 86400, 570),
+        ];
+
+        Database::new(dst, rra_list)
+    }
 
-/// Sync/Flush the RRD journal
-pub fn sync_journal() {
-    let rrd_cache = get_cache();
-    if let Err(err) = rrd_cache.sync_journal() {
-        log::error!("rrd_sync_journal failed - {err}");
+    /// Extracts data for the specified time frame from RRD cache
+    pub fn extract_data(
+        &self,
+        basedir: &str,
+        name: &str,
+        timeframe: RrdTimeframe,
+        mode: RrdMode,
+    ) -> Result<Option<proxmox_rrd::Entry>, Error> {
+        let end = proxmox_time::epoch_f64() as u64;
+
+        let (start, resolution) = match timeframe {
+            RrdTimeframe::Hour => (end - 3600, 60),
+            RrdTimeframe::Day => (end - 3600 * 24, 60),
+            RrdTimeframe::Week => (end - 3600 * 24 * 7, 30 * 60),
+            RrdTimeframe::Month => (end - 3600 * 24 * 30, 30 * 60),
+            RrdTimeframe::Year => (end - 3600 * 24 * 365, 6 * 60 * 60),
+            RrdTimeframe::Decade => (end - 10 * 3600 * 24 * 366, 7 * 86400),
+        };
+
+        let cf = match mode {
+            RrdMode::Max => AggregationFn::Maximum,
+            RrdMode::Average => AggregationFn::Average,
+        };
+
+        self.cache
+            .extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
     }
-}
 
-/// Update RRD Gauge values
-pub fn update_value(
-    rrd_cache: &Cache,
-    name: &str,
-    value: f64,
-    timestamp: i64,
-    datasource_type: DataSourceType,
-) {
-    if let Err(err) =
-        rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
-    {
-        log::error!("rrd::update_value '{name}' failed - {err}");
+    /// Update RRD Gauge values
+    pub fn update_value(
+        &self,
+        name: &str,
+        value: f64,
+        timestamp: i64,
+        datasource_type: DataSourceType,
+    ) {
+        if let Err(err) =
+            self.cache
+                .update_value_ignore_old(name, timestamp as f64, value, datasource_type)
+        {
+            log::error!("rrd::update_value '{name}' failed - {err}");
+        }
     }
 }
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index e5abb2d..89fe4d3 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -3,12 +3,12 @@ use std::sync::Arc;
 use anyhow::Error;
 use tokio::sync::{mpsc::Receiver, oneshot};
 
-use proxmox_rrd::{rrd::DataSourceType, Cache};
+use proxmox_rrd::rrd::DataSourceType;
 
 use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
 use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
 
-use super::rrd_cache;
+use super::rrd_cache::RrdCache;
 
 /// Store requrest for the RRD task.
 pub(super) enum RrdStoreRequest {
@@ -41,7 +41,7 @@ pub(super) struct RrdStoreResult {
 /// Task which stores received metrics in the RRD. Metric data is fed into
 /// this task via a MPSC channel.
 pub(super) async fn store_in_rrd_task(
-    cache: Arc<Cache>,
+    cache: Arc<RrdCache>,
     mut receiver: Receiver<RrdStoreRequest>,
 ) -> Result<(), Error> {
     while let Some(msg) = receiver.recv().await {
@@ -88,7 +88,7 @@ pub(super) async fn store_in_rrd_task(
     Ok(())
 }
 
-fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetricsData) {
+fn store_metric_pve(cache: &RrdCache, remote_name: &str, data_point: &ClusterMetricsData) {
     let name = format!(
         "pve/{remote_name}/{id}/{metric}",
         id = data_point.id,
@@ -101,8 +101,7 @@ fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetric
         ClusterMetricsDataType::Derive => DataSourceType::Derive,
     };
 
-    rrd_cache::update_value(
-        cache,
+    cache.update_value(
         &name,
         data_point.value,
         data_point.timestamp,
@@ -110,7 +109,7 @@ fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetric
     );
 }
 
-fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoint) {
+fn store_metric_pbs(cache: &RrdCache, remote_name: &str, data_point: &MetricDataPoint) {
     let name = format!(
         "pbs/{remote_name}/{id}/{metric}",
         id = data_point.id,
@@ -123,8 +122,7 @@ fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoi
         MetricDataType::Derive => DataSourceType::Derive,
     };
 
-    rrd_cache::update_value(
-        cache,
+    cache.update_value(
         &name,
         data_point.value,
         data_point.timestamp,
@@ -149,7 +147,7 @@ mod tests {
         // Arrange
         let dir = NamedTempDir::new()?;
         let options = get_create_options().perm(nix::sys::stat::Mode::from_bits_truncate(0o700));
-        let cache = rrd_cache::init(&dir.path(), options.clone(), options.clone())?;
+        let cache = Arc::new(RrdCache::new(dir.path(), options.clone(), options.clone())?);
 
         let (tx, rx) = tokio::sync::mpsc::channel(10);
         let task = store_in_rrd_task(Arc::clone(&cache), rx);
@@ -208,8 +206,7 @@ mod tests {
 
         // There is some race condition in proxmox_rrd, in some rare cases
         // extract_data does not return any data directly after writing.
-        if let Some(data) = rrd_cache::extract_data(
-            &cache,
+        if let Some(data) = cache.extract_data(
             "pve/some-remote/node/some-node",
             "cpu_current",
             RrdTimeframe::Hour,
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
index f54ee72..31e36c3 100644
--- a/server/src/metric_collection/top_entities.rs
+++ b/server/src/metric_collection/top_entities.rs
@@ -123,8 +123,7 @@ fn get_entity(
 ) -> Option<(f64, TopEntity)> {
     let cache = rrd_cache::get_cache();
 
-    if let Ok(Some(values)) = rrd_cache::extract_data(
-        &cache,
+    if let Ok(Some(values)) = cache.extract_data(
         &name,
         metric,
         timeframe,
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 16/25] metric collection: record remote response time in metric database
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (14 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 15/25] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 17/25] metric collection: save time needed for collection run to RRD Lukas Wagner
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This gives us the ability to retrieve max/avg response times for a given
time window.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 .../src/metric_collection/collection_task.rs  | 38 +++++++++++--------
 server/src/metric_collection/rrd_task.rs      | 31 +++++++++++++++
 2 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index f349056..27a9987 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -363,6 +363,7 @@ impl MetricCollectionTask {
         let (result_tx, result_rx) = oneshot::channel();
 
         let now = proxmox_time::epoch_i64();
+        let start = Instant::now();
 
         let res: Result<RrdStoreResult, Error> = async {
             match remote.ty {
@@ -376,11 +377,15 @@ impl MetricCollectionTask {
                         )
                         .await?;
 
+                    let duration = start.elapsed();
+
                     sender
                         .send(RrdStoreRequest::Pve {
                             remote: remote.id.clone(),
                             metrics,
                             channel: result_tx,
+                            response_time: duration.as_secs_f64() * 1000.,
+                            request_at: now,
                         })
                         .await?;
                 }
@@ -390,15 +395,19 @@ impl MetricCollectionTask {
                         .metrics(Some(true), Some(status.most_recent_datapoint))
                         .await?;
 
+                    let duration = start.elapsed();
+
                     sender
                         .send(RrdStoreRequest::Pbs {
                             remote: remote.id.clone(),
                             metrics,
                             channel: result_tx,
+                            response_time: duration.as_secs_f64() * 1000.,
+                            request_at: now,
                         })
                         .await?;
                 }
-            }
+            };
 
             result_rx.await.map_err(Error::from)
         }
@@ -582,20 +591,19 @@ pub(super) mod tests {
         while let Some(request) = rx.recv().await {
             number_of_requests += 1;
 
-            match a {
-                RrdStoreRequest::Pve {
-                    remote: _,
-                    metrics,
-                    channel,
-                } => {
-                    let most_recent_timestamp =
-                        metrics.data.iter().fold(0, |acc, e| acc.max(e.timestamp));
-
-                    let _ = channel.send(RrdStoreResult {
-                        most_recent_timestamp,
-                    });
-                }
-                _ => {}
+            if let RrdStoreRequest::Pve {
+                remote: _,
+                metrics,
+                channel,
+                ..
+            } = request
+            {
+                let most_recent_timestamp =
+                    metrics.data.iter().fold(0, |acc, e| acc.max(e.timestamp));
+
+                let _ = channel.send(RrdStoreResult {
+                    most_recent_timestamp,
+                });
             }
         }
 
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 89fe4d3..7f20d2d 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -20,6 +20,10 @@ pub(super) enum RrdStoreRequest {
         metrics: ClusterMetrics,
         /// Oneshot channel to return the [`RrdStoreResult`].
         channel: oneshot::Sender<RrdStoreResult>,
+        /// Reponse time in ms for the API request.
+        response_time: f64,
+        /// Timestamp at which the request was done.
+        request_at: i64,
     },
     /// Store PBS metrics.
     Pbs {
@@ -29,6 +33,10 @@ pub(super) enum RrdStoreRequest {
         metrics: Metrics,
         /// Oneshot channel to return the [`RrdStoreResult`].
         channel: oneshot::Sender<RrdStoreResult>,
+        /// Reponse time in ms for the API request.
+        response_time: f64,
+        /// Timestamp at which the request was done.
+        request_at: i64,
     },
 }
 
@@ -54,11 +62,14 @@ pub(super) async fn store_in_rrd_task(
                     remote,
                     metrics,
                     channel,
+                    response_time,
+                    request_at,
                 } => {
                     for data_point in metrics.data {
                         most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
                         store_metric_pve(&cache_clone, &remote, &data_point);
                     }
+                    store_response_time(&cache_clone, &remote, response_time, request_at);
 
                     channel
                 }
@@ -66,11 +77,14 @@ pub(super) async fn store_in_rrd_task(
                     remote,
                     metrics,
                     channel,
+                    response_time,
+                    request_at,
                 } => {
                     for data_point in metrics.data {
                         most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
                         store_metric_pbs(&cache_clone, &remote, &data_point);
                     }
+                    store_response_time(&cache_clone, &remote, response_time, request_at);
 
                     channel
                 }
@@ -130,6 +144,12 @@ fn store_metric_pbs(cache: &RrdCache, remote_name: &str, data_point: &MetricData
     );
 }
 
+fn store_response_time(cache: &RrdCache, remote_name: &str, response_time: f64, timestamp: i64) {
+    let name = format!("local/metric-collection/remotes/{remote_name}/response-time");
+
+    cache.update_value(&name, response_time, timestamp, DataSourceType::Gauge);
+}
+
 #[cfg(test)]
 mod tests {
     use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
@@ -192,6 +212,8 @@ mod tests {
             remote: "some-remote".into(),
             metrics,
             channel: tx_back,
+            response_time: 10.0,
+            request_at: now,
         };
 
         // Act
@@ -218,6 +240,15 @@ mod tests {
             assert!(data.data.iter().any(Option::is_some));
         }
 
+        if let Some(data) = cache.extract_data(
+            "local/metric-collection/remotes/some-remote",
+            "response-time",
+            RrdTimeframe::Hour,
+            RrdMode::Max,
+        )? {
+            assert!(data.data.iter().any(Option::is_some));
+        }
+
         Ok(())
     }
 }
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 17/25] metric collection: save time needed for collection run to RRD
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (15 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 16/25] metric collection: record remote response time in metric database Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-13 11:53   ` Wolfgang Bumiller
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 18/25] metric collection: periodically clean removed remotes from statefile Lukas Wagner
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

For large setups, it might be useful to know how much time was needed to
collect metrics for *all* remotes together, e.g. for making sure that
the collection interval is not exceeded.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 .../src/metric_collection/collection_task.rs  | 15 ++++++
 server/src/metric_collection/rrd_task.rs      | 50 +++++++++++++++----
 2 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 27a9987..501df9b 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -22,6 +22,7 @@ use pdm_api_types::{
 };
 use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
 
+use crate::metric_collection::rrd_task::CollectionStats;
 use crate::{connection, task_utils};
 
 use super::{
@@ -97,6 +98,20 @@ impl MetricCollectionTask {
                     if let Some(remotes) = Self::load_remote_config() {
                         let to_fetch = remotes.order.as_slice();
                         self.fetch_remotes(&remotes, to_fetch).await;
+
+                        let now = Instant::now();
+                        self.fetch_remotes(&remotes, &remotes.order).await;
+                        let elapsed = now.elapsed();
+
+                        if let Err(err) = self.metric_data_tx.send(
+                            RrdStoreRequest::CollectionStats {
+                                timestamp: proxmox_time::epoch_i64(),
+                                stats: CollectionStats {
+                                    total_time: elapsed.as_secs_f64() * 1000.
+                                }
+                            }).await {
+                            log::error!("could not send collection stats to rrd task: {err}");
+                        }
                     }
                 }
 
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 7f20d2d..6a4bacf 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -38,6 +38,13 @@ pub(super) enum RrdStoreRequest {
         /// Timestamp at which the request was done.
         request_at: i64,
     },
+    /// Store collection stats.
+    CollectionStats {
+        /// Timestamp at which the collection took place.
+        timestamp: i64,
+        /// Statistics.
+        stats: CollectionStats,
+    },
 }
 
 /// Result for a [`RrdStoreRequest`].
@@ -46,6 +53,12 @@ pub(super) struct RrdStoreResult {
     pub(super) most_recent_timestamp: i64,
 }
 
+/// Statistics for a (full) metric collection run.
+pub(super) struct CollectionStats {
+    /// Total time in ms
+    pub(super) total_time: f64,
+}
+
 /// Task which stores received metrics in the RRD. Metric data is fed into
 /// this task via a MPSC channel.
 pub(super) async fn store_in_rrd_task(
@@ -57,7 +70,8 @@ pub(super) async fn store_in_rrd_task(
         //// Involves some blocking file IO
         tokio::task::spawn_blocking(move || {
             let mut most_recent_timestamp = 0;
-            let channel = match msg {
+
+            match msg {
                 RrdStoreRequest::Pve {
                     remote,
                     metrics,
@@ -71,7 +85,13 @@ pub(super) async fn store_in_rrd_task(
                     }
                     store_response_time(&cache_clone, &remote, response_time, request_at);
 
-                    channel
+                    let result = RrdStoreResult {
+                        most_recent_timestamp,
+                    };
+
+                    if let Err(_err) = channel.send(result) {
+                        log::error!("could not send RrdStoreStoreResult to metric collection task");
+                    };
                 }
                 RrdStoreRequest::Pbs {
                     remote,
@@ -86,14 +106,17 @@ pub(super) async fn store_in_rrd_task(
                     }
                     store_response_time(&cache_clone, &remote, response_time, request_at);
 
-                    channel
-                }
-            };
+                    let result = RrdStoreResult {
+                        most_recent_timestamp,
+                    };
 
-            if let Err(_err) = channel.send(RrdStoreResult {
-                most_recent_timestamp,
-            }) {
-                log::error!("could not send RrdStoreStoreResult to metric collection task");
+                    if let Err(_err) = channel.send(result) {
+                        log::error!("could not send RrdStoreStoreResult to metric collection task");
+                    };
+                }
+                RrdStoreRequest::CollectionStats { timestamp, stats } => {
+                    store_stats(&cache_clone, &stats, timestamp)
+                }
             };
         })
         .await?;
@@ -150,6 +173,15 @@ fn store_response_time(cache: &RrdCache, remote_name: &str, response_time: f64,
     cache.update_value(&name, response_time, timestamp, DataSourceType::Gauge);
 }
 
+fn store_stats(cache: &RrdCache, stats: &CollectionStats, timestamp: i64) {
+    cache.update_value(
+        "local/metric-collection/total-time",
+        stats.total_time,
+        timestamp,
+        DataSourceType::Gauge,
+    );
+}
+
 #[cfg(test)]
 mod tests {
     use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 18/25] metric collection: periodically clean removed remotes from statefile
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (16 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 17/25] metric collection: save time needed for collection run to RRD Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 19/25] api: add endpoint for updating metric collection settings Lukas Wagner
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

Adding and removing remotes can leave leftover data in the statefile,
hence it makes sense to clean it up periodically.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 .../src/metric_collection/collection_task.rs  | 10 ++++--
 server/src/metric_collection/state.rs         | 32 +++++++++++++++++++
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 501df9b..7475368 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -96,11 +96,11 @@ impl MetricCollectionTask {
                     self.sleep_for_random_interval_offset().await;
 
                     if let Some(remotes) = Self::load_remote_config() {
-                        let to_fetch = remotes.order.as_slice();
-                        self.fetch_remotes(&remotes, to_fetch).await;
+                        self.cleanup_removed_remotes_from_state(&remotes);
 
                         let now = Instant::now();
-                        self.fetch_remotes(&remotes, &remotes.order).await;
+                        let to_fetch = remotes.order.as_slice();
+                        self.fetch_remotes(&remotes, to_fetch).await;
                         let elapsed = now.elapsed();
 
                         if let Err(err) = self.metric_data_tx.send(
@@ -196,6 +196,10 @@ impl MetricCollectionTask {
         tokio::time::sleep(Duration::from_millis(jitter)).await;
     }
 
+    fn cleanup_removed_remotes_from_state(&mut self, remotes: &SectionConfigData<Remote>) {
+        self.state.retain(|remote| remotes.get(remote).is_some());
+    }
+
     fn get_settings_or_default() -> CollectionSettings {
         // This function is a bit odd, but not sure if there is a much nicer
         // way to do it. We want to fall back to defaults if
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
index a9f712d..2e90a4a 100644
--- a/server/src/metric_collection/state.rs
+++ b/server/src/metric_collection/state.rs
@@ -76,6 +76,11 @@ impl MetricCollectionState {
         Ok(())
     }
 
+    /// Retain all remotes for which the predicate holds.
+    pub fn retain(&mut self, check: impl Fn(&str) -> bool) {
+        self.state.remote_status.retain(|remote, _| check(remote));
+    }
+
     fn load_or_default(path: &Path) -> Result<State, Error> {
         let content = proxmox_sys::fs::file_read_optional_string(path)?;
 
@@ -117,4 +122,31 @@ mod tests {
 
         Ok(())
     }
+
+    #[test]
+    fn test_retain() -> Result<(), Error> {
+        let file = NamedTempFile::new(get_create_options())?;
+        let options = get_create_options();
+        let mut state = MetricCollectionState::new(file.path().into(), options.clone());
+
+        state.set_status(
+            "remote-1".into(),
+            RemoteStatus {
+                ..Default::default()
+            },
+        );
+        state.set_status(
+            "remote-2".into(),
+            RemoteStatus {
+                ..Default::default()
+            },
+        );
+
+        state.retain(|remote| remote == "remote-1");
+
+        assert!(state.get_status("remote-1").is_some());
+        assert!(state.get_status("remote-2").is_none());
+
+        Ok(())
+    }
 }
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 19/25] api: add endpoint for updating metric collection settings
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (17 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 18/25] metric collection: periodically clean removed remotes from statefile Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-13 12:09   ` Wolfgang Bumiller
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 20/25] api: add endpoint to trigger metric collection Lukas Wagner
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This one lives under /config/metric-collection.
At the moment, we do not offer full CRUD, but only offer a hard-coded
collection settings instance at /config/metric-collection/default, which
can be retrieved via GET and updated via PUT.

Later, we might update this to full CRUD, e.g. to have different
settings for different 'poll-groups' consisting of a sub-set of remotes.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/config/metric_collection.rs | 166 +++++++++++++++++++++
 server/src/api/config/mod.rs               |   2 +
 2 files changed, 168 insertions(+)
 create mode 100644 server/src/api/config/metric_collection.rs

diff --git a/server/src/api/config/metric_collection.rs b/server/src/api/config/metric_collection.rs
new file mode 100644
index 0000000..1146101
--- /dev/null
+++ b/server/src/api/config/metric_collection.rs
@@ -0,0 +1,166 @@
+use anyhow::{bail, Error};
+
+use proxmox_config_digest::ConfigDigest;
+use proxmox_router::{list_subdirs_api_method, Router, RpcEnvironment, SubdirMap};
+use proxmox_schema::api;
+use proxmox_sortable_macro::sortable;
+
+use pdm_api_types::{
+    CollectionSettings, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
+};
+use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
+
+#[sortable]
+const SUBDIRS: SubdirMap = &sorted!([("default", &DEFAULT_ROUTER),]);
+
+pub const ROUTER: Router = Router::new()
+    .get(&list_subdirs_api_method!(SUBDIRS))
+    .subdirs(SUBDIRS);
+
+const DEFAULT_ROUTER: Router = Router::new()
+    .get(&API_METHOD_GET_SETTINGS)
+    .put(&API_METHOD_UPDATE_SETTINGS);
+
+#[api(
+    protected: false,
+)]
+/// Get metric collection settings.
+pub fn get_settings(rpcenv: &mut dyn RpcEnvironment) -> Result<CollectionSettings, Error> {
+    let (config, digest) = pdm_config::metric_collection::config()?;
+    rpcenv["digest"] = digest.to_hex().into();
+
+    if config.sections.contains_key("default") {
+        let a = config.lookup(COLLECTION_SETTINGS_TYPE, "default")?;
+        Ok(a)
+    } else {
+        Ok(CollectionSettings {
+            id: "default".into(),
+            ..Default::default()
+        })
+    }
+}
+
+#[api(
+    input: {
+        properties: {
+            updater: {
+                type: CollectionSettingsUpdater,
+            },
+            delete: {
+                type: Array,
+                items: {
+                    type: DeletableCollectionSettingsProperty,
+                },
+                description: "Array of properties that should be deleted.",
+                optional: true,
+            },
+            digest: {
+                type: ConfigDigest,
+                optional: true,
+            },
+        },
+    },
+)]
+/// Update metric collection settings.
+pub fn update_settings(
+    updater: CollectionSettingsUpdater,
+    delete: Option<Vec<DeletableCollectionSettingsProperty>>,
+    digest: Option<ConfigDigest>,
+) -> Result<(), Error> {
+    let (mut config, existing_digest) = pdm_config::metric_collection::config()?;
+    if digest.is_some() {
+        existing_digest.detect_modification(digest.as_ref())?;
+    }
+
+    let mut entry = if config.sections.contains_key("default") {
+        config.lookup(COLLECTION_SETTINGS_TYPE, "default")?
+    } else {
+        CollectionSettings {
+            id: "default".into(),
+            ..Default::default()
+        }
+    };
+
+    let mut interval_offset_changed = false;
+    let mut connection_delay_changed = false;
+
+    if let Some(delete) = delete {
+        for property in delete {
+            use DeletableCollectionSettingsProperty::*;
+
+            match property {
+                MaxIntervalOffset => {
+                    entry.max_interval_offset = None;
+                    interval_offset_changed = true;
+                }
+
+                MinIntervalOffset => {
+                    entry.min_interval_offset = None;
+                    interval_offset_changed = true;
+                }
+                MaxConnectionDelay => {
+                    entry.max_connection_delay = None;
+                    connection_delay_changed = true;
+                }
+                MinConnectionDelay => {
+                    entry.min_connection_delay = None;
+                    connection_delay_changed = true;
+                }
+                MaxConcurrentConnections => entry.max_concurrent_connections = None,
+                CollectionInterval => entry.collection_interval = None,
+            }
+        }
+    }
+
+    let CollectionSettingsUpdater {
+        collection_interval,
+        max_concurrent_connections,
+        max_interval_offset,
+        min_interval_offset,
+        max_connection_delay,
+        min_connection_delay,
+    } = updater;
+
+    if let Some(value) = collection_interval {
+        entry.collection_interval = Some(value);
+    }
+    if let Some(value) = max_concurrent_connections {
+        entry.max_concurrent_connections = Some(value);
+    }
+    if let Some(value) = max_interval_offset {
+        entry.max_interval_offset = Some(value);
+        interval_offset_changed = true;
+    }
+    if let Some(value) = min_interval_offset {
+        entry.min_interval_offset = Some(value);
+        interval_offset_changed = true;
+    }
+    if let Some(value) = max_connection_delay {
+        entry.max_connection_delay = Some(value);
+        connection_delay_changed = true;
+    }
+    if let Some(value) = min_connection_delay {
+        entry.min_connection_delay = Some(value);
+        connection_delay_changed = true;
+    }
+
+    // Ensure that the minimum value is <= the maximum value - but only if we have
+    // actually changed the value. We do not want to fail if the config already contains
+    // something invalid and we are changing *other* settings.
+    if entry.min_interval_offset_or_default() > entry.max_interval_offset_or_default()
+        && interval_offset_changed
+    {
+        bail!("min-interval-offset must not be larger than max-interval-offset");
+    }
+
+    if (entry.min_connection_delay_or_default() > entry.max_connection_delay_or_default())
+        && connection_delay_changed
+    {
+        bail!("min-connection-delay must not be larger than max-connection-delay");
+    }
+
+    config.set_data("default", COLLECTION_SETTINGS_TYPE, entry)?;
+    pdm_config::metric_collection::save_config(&config)?;
+
+    Ok(())
+}
diff --git a/server/src/api/config/mod.rs b/server/src/api/config/mod.rs
index 7b58c75..15dc42c 100644
--- a/server/src/api/config/mod.rs
+++ b/server/src/api/config/mod.rs
@@ -5,6 +5,7 @@ use proxmox_sortable_macro::sortable;
 pub mod access;
 pub mod acme;
 pub mod certificate;
+pub mod metric_collection;
 pub mod notes;
 
 #[sortable]
@@ -12,6 +13,7 @@ const SUBDIRS: SubdirMap = &sorted!([
     ("access", &access::ROUTER),
     ("acme", &acme::ROUTER),
     ("certificate", &certificate::ROUTER),
+    ("metric-collection", &metric_collection::ROUTER),
     ("notes", &notes::ROUTER),
 ]);
 
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 20/25] api: add endpoint to trigger metric collection
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (18 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 19/25] api: add endpoint for updating metric collection settings Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 21/25] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This one lives under /metric-collection/trigger.
Doing a POST will trigger metric collection for a remote provided in the
'remote' parameter, or all remotes if the parameter is not provided.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/metric_collection.rs | 35 +++++++++++++++++++++++++++++
 server/src/api/mod.rs               |  2 ++
 2 files changed, 37 insertions(+)
 create mode 100644 server/src/api/metric_collection.rs

diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
new file mode 100644
index 0000000..f0a8ea8
--- /dev/null
+++ b/server/src/api/metric_collection.rs
@@ -0,0 +1,35 @@
+use anyhow::Error;
+
+use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
+use proxmox_router::{Router, SubdirMap};
+use proxmox_schema::api;
+use proxmox_sortable_macro::sortable;
+
+pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
+
+#[sortable]
+const SUBDIRS: SubdirMap = &sorted!([(
+    "trigger",
+    &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+),]);
+
+#[api(
+    input: {
+        properties: {
+            remote: {
+                schema: REMOTE_ID_SCHEMA,
+                optional: true,
+            },
+        },
+    },
+)]
+/// Trigger metric collection for a given remote.
+pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+    if let Some(remote) = remote {
+        crate::metric_collection::trigger_metric_collection_for_remote(&remote).await?;
+    } else {
+        crate::metric_collection::trigger_metric_collection().await?;
+    }
+
+    Ok(())
+}
diff --git a/server/src/api/mod.rs b/server/src/api/mod.rs
index 6c4831b..ff875fc 100644
--- a/server/src/api/mod.rs
+++ b/server/src/api/mod.rs
@@ -9,6 +9,7 @@ use proxmox_sortable_macro::sortable;
 
 pub mod access;
 pub mod config;
+pub mod metric_collection;
 pub mod nodes;
 pub mod pbs;
 pub mod pve;
@@ -21,6 +22,7 @@ mod rrd_common;
 const SUBDIRS: SubdirMap = &sorted!([
     ("access", &access::ROUTER),
     ("config", &config::ROUTER),
+    ("metric-collection", &metric_collection::ROUTER),
     ("ping", &Router::new().get(&API_METHOD_PING)),
     ("pve", &pve::ROUTER),
     ("pbs", &pbs::ROUTER),
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 21/25] api: remotes: trigger immediate metric collection for newly added nodes
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (19 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 20/25] api: add endpoint to trigger metric collection Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 22/25] api: add api for querying metric collection RRD data Lukas Wagner
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This makes sure that metric graphs are available right after adding a
new remote.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 server/src/api/remotes.rs | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index 8ff1ada..60c451a 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -18,6 +18,7 @@ use proxmox_time::{epoch_i64, epoch_to_rfc2822};
 use pdm_api_types::remotes::{Remote, RemoteType, RemoteUpdater, REMOTE_ID_SCHEMA};
 use pdm_api_types::{Authid, ConfigDigest, PRIV_RESOURCE_AUDIT, PRIV_RESOURCE_MODIFY};
 
+use crate::metric_collection;
 use crate::{connection, pbs_client};
 
 use super::pve;
@@ -171,10 +172,15 @@ pub async fn add_remote(mut entry: Remote, create_token: Option<String>) -> Resu
         entry.token = token;
     }
 
+    let name = entry.id.clone();
     remotes.insert(entry.id.to_owned(), entry);
 
     pdm_config::remotes::save_config(&remotes)?;
 
+    if let Err(e) = metric_collection::trigger_metric_collection_for_remote(&name).await {
+        log::error!("could not trigger metric collection after adding remote: {e}");
+    }
+
     Ok(())
 }
 
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 22/25] api: add api for querying metric collection RRD data
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (20 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 21/25] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 23/25] api: metric-collection: add status endpoint Lukas Wagner
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This commit adds two new API endpoints:

 - remotes/{id}/metric-collection-rrddata
 - metric-collection/rrddata

 The first one returns graphable datapoints for the API response time
 when fetching metrics, the second one datapoints about the total time
 needed for collecting *all* remotes.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 lib/pdm-api-types/src/rrddata.rs    | 26 +++++++++++++
 server/src/api/metric_collection.rs | 60 ++++++++++++++++++++++++++---
 server/src/api/remotes.rs           | 53 +++++++++++++++++++++++++
 3 files changed, 134 insertions(+), 5 deletions(-)

diff --git a/lib/pdm-api-types/src/rrddata.rs b/lib/pdm-api-types/src/rrddata.rs
index a973977..7559243 100644
--- a/lib/pdm-api-types/src/rrddata.rs
+++ b/lib/pdm-api-types/src/rrddata.rs
@@ -216,3 +216,29 @@ pub struct PbsDatastoreDataPoint {
     #[serde(skip_serializing_if = "Option::is_none")]
     pub disk_write: Option<f64>,
 }
+
+#[api]
+#[derive(Serialize, Deserialize, Default)]
+#[serde(rename_all = "kebab-case")]
+/// RRD datapoint for statistics about the metric collection loop.
+pub struct FullCollectionDatapoint {
+    /// Timestamp (UNIX epoch)
+    pub time: u64,
+
+    /// Total time in milliseconds needed for full metric collection run.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub total_time: Option<f64>,
+}
+
+#[api]
+#[derive(Serialize, Deserialize, Default)]
+#[serde(rename_all = "kebab-case")]
+/// RRD datapoint for metric collection from a single remote.
+pub struct RemoteCollectionDatapoint {
+    /// Timestamp (UNIX epoch)
+    pub time: u64,
+
+    /// API response time in milliseconds when requesting the metrics from the remote.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub response_time: Option<f64>,
+}
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index f0a8ea8..8b6748d 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -1,17 +1,27 @@
 use anyhow::Error;
 
-use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
 use proxmox_router::{Router, SubdirMap};
+use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
 use proxmox_schema::api;
 use proxmox_sortable_macro::sortable;
 
+use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint};
+
+use super::rrd_common::{self, DataPoint};
+
 pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
 
 #[sortable]
-const SUBDIRS: SubdirMap = &sorted!([(
-    "trigger",
-    &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
-),]);
+const SUBDIRS: SubdirMap = &sorted!([
+    (
+        "trigger",
+        &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+    ),
+    (
+        "rrddata",
+        &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_RRD_DATA)
+    ),
+]);
 
 #[api(
     input: {
@@ -33,3 +43,43 @@ pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Err
 
     Ok(())
 }
+
+impl DataPoint for FullCollectionDatapoint {
+    fn new(time: u64) -> Self {
+        Self {
+            time,
+            ..Default::default()
+        }
+    }
+
+    fn fields() -> &'static [&'static str] {
+        &["total-time"]
+    }
+
+    fn set_field(&mut self, name: &str, value: f64) {
+        if name == "total-time" {
+            self.total_time = Some(value);
+        }
+    }
+}
+
+#[api(
+    input: {
+        properties: {
+            timeframe: {
+                type: RrdTimeframe,
+            },
+            cf: {
+                type: RrdMode,
+            },
+        },
+    },
+)]
+/// Read metric collection RRD data.
+fn get_metric_collection_rrd_data(
+    timeframe: RrdTimeframe,
+    cf: RrdMode,
+) -> Result<Vec<FullCollectionDatapoint>, Error> {
+    let base = "local/metric-collection";
+    rrd_common::create_datapoints_from_rrd(base, timeframe, cf)
+}
diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index 60c451a..17140d8 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -9,6 +9,8 @@ use proxmox_access_control::CachedUserInfo;
 use proxmox_router::{
     http_bail, http_err, list_subdirs_api_method, Permission, Router, RpcEnvironment, SubdirMap,
 };
+use proxmox_rrd_api_types::RrdMode;
+use proxmox_rrd_api_types::RrdTimeframe;
 use proxmox_schema::api;
 use proxmox_schema::Schema;
 use proxmox_section_config::typed::SectionConfigData;
@@ -16,12 +18,15 @@ use proxmox_sortable_macro::sortable;
 use proxmox_time::{epoch_i64, epoch_to_rfc2822};
 
 use pdm_api_types::remotes::{Remote, RemoteType, RemoteUpdater, REMOTE_ID_SCHEMA};
+use pdm_api_types::rrddata::RemoteCollectionDatapoint;
 use pdm_api_types::{Authid, ConfigDigest, PRIV_RESOURCE_AUDIT, PRIV_RESOURCE_MODIFY};
 
 use crate::metric_collection;
 use crate::{connection, pbs_client};
 
 use super::pve;
+use super::rrd_common;
+use super::rrd_common::DataPoint;
 
 pub const ROUTER: Router = Router::new()
     .get(&API_METHOD_LIST_REMOTES)
@@ -38,6 +43,10 @@ const ITEM_ROUTER: Router = Router::new()
 const SUBDIRS: SubdirMap = &sorted!([
     ("config", &Router::new().get(&API_METHOD_REMOTE_CONFIG)),
     ("version", &Router::new().get(&API_METHOD_VERSION)),
+    (
+        "metric-collection-rrddata",
+        &Router::new().get(&API_METHOD_GET_PER_REMOTE_METRIC_COLLECTION_RRD_DATA)
+    ),
 ]);
 
 pub fn get_remote<'a>(
@@ -331,3 +340,47 @@ pub fn remote_config(id: String) -> Result<Remote, Error> {
     remote.token = String::new(); // mask token in response
     Ok(remote.clone())
 }
+
+impl DataPoint for RemoteCollectionDatapoint {
+    fn new(time: u64) -> Self {
+        Self {
+            time,
+            ..Default::default()
+        }
+    }
+
+    fn fields() -> &'static [&'static str] {
+        &["response-time"]
+    }
+
+    fn set_field(&mut self, name: &str, value: f64) {
+        if name == "response-time" {
+            self.response_time = Some(value);
+        }
+    }
+}
+
+#[api(
+    input: {
+        properties: {
+            id: {
+                schema: REMOTE_ID_SCHEMA,
+            },
+            timeframe: {
+                type: RrdTimeframe,
+            },
+            cf: {
+                type: RrdMode,
+            },
+        },
+    },
+)]
+/// Read metric collection RRD data.
+fn get_per_remote_metric_collection_rrd_data(
+    id: String,
+    timeframe: RrdTimeframe,
+    cf: RrdMode,
+) -> Result<Vec<RemoteCollectionDatapoint>, Error> {
+    let base = format!("local/metric-collection/remotes/{id}");
+    rrd_common::create_datapoints_from_rrd(&base, timeframe, cf)
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 23/25] api: metric-collection: add status endpoint
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (21 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 22/25] api: add api for querying metric collection RRD data Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 24/25] pdm-client: add metric collection API methods Lukas Wagner
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This adds a new API endpoint at /metric-collection/status, returning the
status of the last metric collection attempt from each remote.
For now, this contains the timestamp (last-collected) and any
error that occurred (error).

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 lib/pdm-api-types/src/metric_collection.rs | 15 +++++++++++++++
 server/src/api/metric_collection.rs        | 16 +++++++++++++++-
 server/src/metric_collection/mod.rs        | 22 ++++++++++++++++++++++
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
index 99dac9b..954e1bd 100644
--- a/lib/pdm-api-types/src/metric_collection.rs
+++ b/lib/pdm-api-types/src/metric_collection.rs
@@ -178,3 +178,18 @@ impl CollectionSettings {
             .unwrap_or(DEFAULT_MIN_CONNECTION_DELAY)
     }
 }
+
+#[api]
+#[derive(Clone, Deserialize, Serialize)]
+#[serde(rename_all = "kebab-case")]
+/// Per-remote collection status.
+pub struct MetricCollectionStatus {
+    /// The remote's name.
+    pub remote: String,
+    /// Any error that occured during the last collection attempt.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub error: Option<String>,
+    /// Timestamp of last successful collection.
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub last_collection: Option<i64>,
+}
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 8b6748d..f89e6a0 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -5,7 +5,11 @@ use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
 use proxmox_schema::api;
 use proxmox_sortable_macro::sortable;
 
-use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint};
+use pdm_api_types::{
+    remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint, MetricCollectionStatus,
+};
+
+use crate::metric_collection;
 
 use super::rrd_common::{self, DataPoint};
 
@@ -21,6 +25,10 @@ const SUBDIRS: SubdirMap = &sorted!([
         "rrddata",
         &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_RRD_DATA)
     ),
+    (
+        "status",
+        &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_STATUS)
+    ),
 ]);
 
 #[api(
@@ -83,3 +91,9 @@ fn get_metric_collection_rrd_data(
     let base = "local/metric-collection";
     rrd_common::create_datapoints_from_rrd(base, timeframe, cf)
 }
+
+#[api]
+/// Read metric collection RRD data.
+fn get_metric_collection_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+    metric_collection::get_status()
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 6551f2b..1069c40 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -7,6 +7,8 @@ use tokio::sync::mpsc::{self, Sender};
 
 use proxmox_sys::fs::CreateOptions;
 
+use pdm_api_types::MetricCollectionStatus;
+
 mod collection_task;
 pub mod rrd_cache;
 mod rrd_task;
@@ -88,3 +90,23 @@ pub async fn trigger_metric_collection() -> Result<(), Error> {
 
     Ok(())
 }
+
+/// Get each remote's metric collection status.
+pub fn get_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+    let (remotes, _) = pdm_config::remotes::config()?;
+    let state = collection_task::load_state()?;
+
+    let mut result = Vec::new();
+
+    for remote in remotes.order {
+        if let Some(status) = state.get_status(&remote) {
+            result.push(MetricCollectionStatus {
+                remote,
+                error: status.error.clone(),
+                last_collection: status.last_collection,
+            })
+        }
+    }
+
+    Ok(result)
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 24/25] pdm-client: add metric collection API methods
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (22 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 23/25] api: metric-collection: add status endpoint Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-13 12:10   ` Wolfgang Bumiller
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status Lukas Wagner
  2025-02-14 13:08 ` [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
  25 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This adds bindings for collection settings, trigger, and status APIs.

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 lib/pdm-client/src/lib.rs | 87 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/lib/pdm-client/src/lib.rs b/lib/pdm-client/src/lib.rs
index a41b82c..ae41e9a 100644
--- a/lib/pdm-client/src/lib.rs
+++ b/lib/pdm-client/src/lib.rs
@@ -320,6 +320,93 @@ impl<T: HttpApiClient> PdmClient<T> {
             .ok_or_else(|| Error::BadApi("api returned no webauthn entry id".to_string(), None))
     }
 
+    /// Get global metric collection settings.
+    pub async fn get_metric_collection_settings(
+        &self,
+    ) -> Result<pdm_api_types::CollectionSettings, Error> {
+        let path = "/api2/extjs/config/metric-collection/default";
+        Ok(self.0.get(path).await?.expect_json()?.data)
+    }
+
+    /// Update global metric collection settings.
+    pub async fn update_metric_collection_settings(
+        &self,
+        updater: pdm_api_types::CollectionSettingsUpdater,
+        delete: Option<Vec<pdm_api_types::DeletableCollectionSettingsProperty>>,
+    ) -> Result<(), Error> {
+        let path = "/api2/extjs/config/metric-collection/default";
+
+        #[derive(Serialize)]
+        struct UpdateSettings<'a> {
+            updater: &'a pdm_api_types::CollectionSettingsUpdater,
+            #[serde(skip_serializing_if = "Option::is_none")]
+            delete: Option<Vec<pdm_api_types::DeletableCollectionSettingsProperty>>,
+        }
+
+        let params = UpdateSettings {
+            updater: &updater,
+            delete,
+        };
+        self.0.put(path, &params).await?.nodata()?;
+        Ok(())
+    }
+
+    /// Trigger metric collection for a single remote or all, if no remote is provided.
+    pub async fn trigger_metric_collection(
+        &self,
+        remote: Option<&str>,
+    ) -> Result<(), proxmox_client::Error> {
+        let path = "/api2/extjs/metric-collection/trigger";
+
+        #[derive(Serialize)]
+        struct TriggerParams<'a> {
+            #[serde(skip_serializing_if = "Option::is_none")]
+            remote: Option<&'a str>,
+        }
+
+        self.0
+            .post(path, &TriggerParams { remote })
+            .await?
+            .expect_json::<()>()?;
+
+        Ok(())
+    }
+
+    /// Get global metric collection status.
+    pub async fn get_metric_collection_status(
+        &self,
+    ) -> Result<Vec<pdm_api_types::MetricCollectionStatus>, Error> {
+        let path = "/api2/extjs/metric-collection/status";
+        Ok(self.0.get(path).await?.expect_json()?.data)
+    }
+
+    /// Get global metric collection status.
+    pub async fn get_metric_collection_rrddata(
+        &self,
+        mode: RrdMode,
+        timeframe: RrdTimeframe,
+    ) -> Result<pdm_api_types::rrddata::FullCollectionDatapoint, Error> {
+        let mut path = "/api2/extjs/metric-collection/rrddata".to_string();
+        let mut sep = '?';
+        add_query_arg(&mut path, &mut sep, "cf", &Some(mode));
+        add_query_arg(&mut path, &mut sep, "timeframe", &Some(timeframe));
+        Ok(self.0.get(&path).await?.expect_json()?.data)
+    }
+
+    /// Get per-remote metric collection status.
+    pub async fn get_per_remote_metric_collection_rrddata(
+        &self,
+        remote: &str,
+        mode: RrdMode,
+        timeframe: RrdTimeframe,
+    ) -> Result<pdm_api_types::rrddata::RemoteCollectionDatapoint, Error> {
+        let mut path = format!("/api2/extjs/remotes/{remote}/metric-collection-rrddata");
+        let mut sep = '?';
+        add_query_arg(&mut path, &mut sep, "cf", &Some(mode));
+        add_query_arg(&mut path, &mut sep, "timeframe", &Some(timeframe));
+        Ok(self.0.get(&path).await?.expect_json()?.data)
+    }
+
     pub async fn pve_list_nodes(
         &self,
         remote: &str,
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (23 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 24/25] pdm-client: add metric collection API methods Lukas Wagner
@ 2025-02-11 12:05 ` Lukas Wagner
  2025-02-13 12:14   ` Wolfgang Bumiller
  2025-02-14 13:08 ` [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
  25 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:05 UTC (permalink / raw)
  To: pdm-devel

This adds new commands to the proxmox-datacenter-client CLI tool, namely
 - to update metric collection settings
 - trigger metric collection
 - show status of the last metric collection

Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
 cli/client/Cargo.toml               |   1 +
 cli/client/src/main.rs              |   2 +
 cli/client/src/metric_collection.rs | 164 ++++++++++++++++++++++++++++
 3 files changed, 167 insertions(+)
 create mode 100644 cli/client/src/metric_collection.rs

diff --git a/cli/client/Cargo.toml b/cli/client/Cargo.toml
index 96016cd..5c96b28 100644
--- a/cli/client/Cargo.toml
+++ b/cli/client/Cargo.toml
@@ -46,6 +46,7 @@ proxmox-rrd-api-types.workspace = true
 proxmox-schema = { workspace = true, features = [ "api-macro" ] }
 proxmox-section-config.workspace = true
 proxmox-sys.workspace = true
+proxmox-time.workspace = true
 proxmox-tfa = { workspace = true, features = [ "types" ] }
 
 # for the pve API types:
diff --git a/cli/client/src/main.rs b/cli/client/src/main.rs
index c913e97..55877cf 100644
--- a/cli/client/src/main.rs
+++ b/cli/client/src/main.rs
@@ -17,6 +17,7 @@ pub mod env;
 
 pub mod acl;
 pub mod config;
+pub mod metric_collection;
 pub mod pbs;
 pub mod pve;
 pub mod remotes;
@@ -93,6 +94,7 @@ fn main_do() -> Result<(), Error> {
         )
         .insert("acl", acl::cli())
         .insert("login", CliCommand::new(&API_METHOD_LOGIN))
+        .insert("metric-collection", metric_collection::cli())
         .insert("pbs", pbs::cli())
         .insert("pve", pve::cli())
         .insert("remote", remotes::cli())
diff --git a/cli/client/src/metric_collection.rs b/cli/client/src/metric_collection.rs
new file mode 100644
index 0000000..8ca861a
--- /dev/null
+++ b/cli/client/src/metric_collection.rs
@@ -0,0 +1,164 @@
+use anyhow::Error;
+use pdm_api_types::{
+    remotes::REMOTE_ID_SCHEMA, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
+};
+use proxmox_router::cli::{
+    format_and_print_result, CliCommand, CliCommandMap, CommandLineInterface, OutputFormat,
+};
+use proxmox_schema::api;
+
+use crate::{client, env};
+
+pub fn cli() -> CommandLineInterface {
+    CliCommandMap::new()
+        .insert(
+            "show-settings",
+            CliCommand::new(&API_METHOD_SHOW_COLLECTION_SETTINGS),
+        )
+        .insert(
+            "update-settings",
+            CliCommand::new(&API_METHOD_UPDATE_COLLECTION_SETTINGS),
+        )
+        .insert(
+            "trigger",
+            CliCommand::new(&API_METHOD_TRIGGER_METRIC_COLLECTION),
+        )
+        .insert(
+            "status",
+            CliCommand::new(&API_METHOD_METRIC_COLLECTION_STATUS),
+        )
+        .into()
+}
+
+#[api]
+/// Show metric collection settings.
+async fn show_collection_settings() -> Result<(), Error> {
+    let client = client()?;
+    let settings = client.get_metric_collection_settings().await?;
+
+    fn default_if_none<T>(value: Option<T>) -> &'static str {
+        if value.is_none() {
+            " (default)"
+        } else {
+            ""
+        }
+    }
+
+    let output_format = env().format_args.output_format;
+    if output_format == OutputFormat::Text {
+        println!("Metric collection settings:");
+        println!(
+            "    collection-interval: {}s {}",
+            settings.collection_interval_or_default(),
+            default_if_none(settings.collection_interval),
+        );
+        println!(
+            "    max-interval-offset: {}s {}",
+            settings.max_interval_offset_or_default(),
+            default_if_none(settings.max_interval_offset)
+        );
+        println!(
+            "    min-interval-offset: {}s {}",
+            settings.min_interval_offset_or_default(),
+            default_if_none(settings.min_interval_offset)
+        );
+        println!(
+            "    max-connection-delay: {}ms {}",
+            settings.max_connection_delay_or_default(),
+            default_if_none(settings.max_connection_delay)
+        );
+        println!(
+            "    min-connection-delay: {}ms {}",
+            settings.min_connection_delay_or_default(),
+            default_if_none(settings.min_connection_delay)
+        );
+        println!(
+            "    max-concurrent-connections: {} {}",
+            settings.max_concurrent_connections_or_default(),
+            default_if_none(settings.max_concurrent_connections),
+        );
+    } else {
+        format_and_print_result(&settings, &output_format.to_string());
+    }
+    Ok(())
+}
+
+#[api(
+    input: {
+        properties: {
+            updater: {
+                type: CollectionSettingsUpdater,
+                flatten: true,
+            },
+            delete: {
+                description: "Clear/reset collection settings properties.",
+                optional: true,
+                type: Array,
+                items: {
+                    type: DeletableCollectionSettingsProperty,
+                },
+            },
+        }
+    }
+)]
+/// Change metric collection settings.
+async fn update_collection_settings(
+    updater: CollectionSettingsUpdater,
+    delete: Option<Vec<DeletableCollectionSettingsProperty>>,
+) -> Result<(), Error> {
+    let client = client()?;
+    client
+        .update_metric_collection_settings(updater, delete)
+        .await?;
+
+    Ok(())
+}
+
+#[api(
+    input: {
+        properties: {
+            remote: {
+                schema: REMOTE_ID_SCHEMA,
+                optional: true,
+            },
+        }
+    }
+)]
+/// Trigger metric collection. If a remote is passed, only this remote will be collected, otherwise
+/// all.
+async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+    client()?
+        .trigger_metric_collection(remote.as_deref())
+        .await?;
+    Ok(())
+}
+
+#[api]
+/// Show metric collection status.
+async fn metric_collection_status() -> Result<(), Error> {
+    let result = client()?.get_metric_collection_status().await?;
+
+    let output_format = env().format_args.output_format;
+    if output_format == OutputFormat::Text {
+        for remote_status in result {
+            let timestamp = if let Some(last_collection) = remote_status.last_collection {
+                proxmox_time::strftime_local("%a, %d %b %Y %T %z", last_collection)?
+            } else {
+                "never".into()
+            };
+
+            let status = if let Some(err) = &remote_status.error {
+                err
+            } else {
+                "ok"
+            };
+
+            println!("{}: {status}", remote_status.remote);
+            println!("    last successful: {timestamp}");
+            println!();
+        }
+    } else {
+        format_and_print_result(&result, &output_format.to_string());
+    }
+    Ok(())
+}
-- 
2.39.5



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task Lukas Wagner
@ 2025-02-11 12:58   ` Lukas Wagner
  2025-02-12 15:57   ` Wolfgang Bumiller
  1 sibling, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-11 12:58 UTC (permalink / raw)
  To: pdm-devel

On  2025-02-11 13:05, Lukas Wagner wrote:
> diff --git a/Cargo.toml b/Cargo.toml
> index 4f3b1d0..7dd60a5 100644
> --- a/Cargo.toml
> +++ b/Cargo.toml
> @@ -108,6 +108,7 @@ once_cell = "1.3.1"
>  openssl = "0.10.40"
>  percent-encoding = "2.1"
>  pin-project-lite = "0.2"
> +rand = "0.8"

^ Forgot to add rand to d/control. I'll add it myself if the need for a v2 arises.
If there is no need for a v2 maybe fold it in when applying.
Any way, a `make sbuild` works without it since we depend on it transitively.

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 03/25] pdm-api-types: add CollectionSettings type
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 03/25] pdm-api-types: add CollectionSettings type Lukas Wagner
@ 2025-02-11 14:18   ` Maximiliano Sandoval
  0 siblings, 0 replies; 49+ messages in thread
From: Maximiliano Sandoval @ 2025-02-11 14:18 UTC (permalink / raw)
  To: Proxmox Datacenter Manager development discussion


Lukas Wagner <l.wagner@proxmox.com> writes:

> This commit adds the CollectionSettings type which holds settings for
> the metric collection system. Included are collection interval, max
> concurrency and upper/lower bounds for the metric collection loop.
>
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  lib/pdm-api-types/src/lib.rs               |   3 +
>  lib/pdm-api-types/src/metric_collection.rs | 180 +++++++++++++++++++++
>  2 files changed, 183 insertions(+)
>  create mode 100644 lib/pdm-api-types/src/metric_collection.rs
>
> diff --git a/lib/pdm-api-types/src/lib.rs b/lib/pdm-api-types/src/lib.rs
> index 3844907..6115e41 100644
> --- a/lib/pdm-api-types/src/lib.rs
> +++ b/lib/pdm-api-types/src/lib.rs
> @@ -19,6 +19,9 @@ pub use acl::*;
>  mod node_config;
>  pub use node_config::*;
>
> +mod metric_collection;
> +pub use metric_collection::*;
> +
>  mod proxy;
>  pub use proxy::HTTP_PROXY_SCHEMA;
>
> diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
> new file mode 100644
> index 0000000..99dac9b
> --- /dev/null
> +++ b/lib/pdm-api-types/src/metric_collection.rs
> @@ -0,0 +1,180 @@
> +//! API types for metric collection settings.
> +
> +use serde::{Deserialize, Serialize};
> +
> +use proxmox_schema::{api, Updater};
> +
> +/// Default metric collection interval.
> +pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
> +
> +/// Minimum metric collection interval.
> +pub const MIN_COLLECTION_INTERVAL: u64 = 10;
> +
> +/// Maximum metric collection interval.
> +/// PVE and PBS keep 30 minutes of metric history,
> +/// maximum is set to 25 minutes to leave some headroom.
> +pub const MAX_COLLECTION_INTERVAL: u64 = 1500;
> +
> +/// Default number of concurrent connections.
> +pub const DEFAULT_CONCURRENCY: usize = 10;
> +/// Maximum number of concurrent connections.
> +pub const MAX_CONCURRENCY: u64 = 50;
> +/// Minimum number of concurrent connections (no concurrency).
> +pub const MIN_CONCURRENCY: u64 = 1;
> +
> +/// Default upper bound for the random delay for the
> +/// main metric collection loop.
> +pub const DEFAULT_MAX_OFFSET: u64 = 10;
> +/// Default lower bound for the random delay for the
> +/// main metric collection loop.
> +pub const DEFAULT_MIN_OFFSET: u64 = 0;
> +
> +/// Highest configurable upper bound for the random interval offset for the main loop.
> +pub const MAX_INTERVAL_OFFSET: u64 = 300;
> +/// Lowest configureable lower bound for the random interval offset for the main loop.

There is a typo on configureable.

> +pub const MIN_INTERVAL_OFFSET: u64 = 0;
> +
> +/// Default upper bound for the random individual connection delay.
> +pub const DEFAULT_MAX_CONNECTION_DELAY: u64 = 100;
> +/// Default lower bound for the random individual connection delay.
> +pub const DEFAULT_MIN_CONNECTION_DELAY: u64 = 0;
> +
> +/// Highest configurable upper bound for the random individual connection delay.
> +pub const MAX_CONNECTION_DELAY: u64 = 1000;
> +/// Lowest configurable lower bound for the random individual connection delay.
> +pub const MIN_CONNECTION_DELAY: u64 = 0;
> +
> +#[api(
> +    properties: {
> +        "collection-interval" : {
> +            optional: true,
> +            default: DEFAULT_COLLECTION_INTERVAL as isize,
> +            minimum: MIN_COLLECTION_INTERVAL as isize,
> +            maximum: MAX_COLLECTION_INTERVAL as isize,
> +        },
> +        "max-concurrent-connections" : {
> +            optional: true,
> +            default: DEFAULT_CONCURRENCY as isize,
> +            minimum: MIN_CONCURRENCY as isize,
> +            maximum: MAX_CONCURRENCY as isize,
> +        },
> +        "max-interval-offset" : {
> +            optional: true,
> +            default: DEFAULT_MAX_OFFSET as isize,
> +            minimum: MIN_INTERVAL_OFFSET as isize,
> +            maximum: MAX_INTERVAL_OFFSET as isize,
> +        },
> +        "min-interval-offset" : {
> +            optional: true,
> +            default: DEFAULT_MIN_OFFSET as isize,
> +            minimum: MIN_INTERVAL_OFFSET as isize,
> +            maximum: MAX_INTERVAL_OFFSET as isize,
> +        },
> +        "max-connection-delay" : {
> +            optional: true,
> +            default: DEFAULT_MAX_CONNECTION_DELAY as isize,
> +            minimum: MIN_CONNECTION_DELAY as isize,
> +            maximum: MAX_CONNECTION_DELAY as isize,
> +        },
> +        "min-connection-delay" : {
> +            optional: true,
> +            default: DEFAULT_MIN_CONNECTION_DELAY as isize,
> +            minimum: MIN_CONNECTION_DELAY as isize,
> +            maximum: MAX_CONNECTION_DELAY as isize,
> +        },
> +    },
> +)]
> +#[derive(Clone, Default, Deserialize, Serialize, Updater)]
> +#[serde(rename_all = "kebab-case")]
> +/// Settings for the metric collection system.
> +pub struct CollectionSettings {
> +    /// Collection settings ID
> +    #[updater(skip)]
> +    pub id: String,
> +
> +    /// Interval in seconds at which to collect metrics.
> +    /// The point in time at which metrics are collected
> +    /// are aligned based on the collection interval. For instance,
> +    /// a collection interval of 300 (5 minutes) would schedule metric collection
> +    /// at 11:00:00, 11:05:00 (without accounting for the random offset).
> +    ///
> +    /// To avoid load spikes, metric collection runs are offeset by

There is a typo on offeset.

> +    /// a random number of seconds between min_interval_offset and max_interval_offset.
> +    #[serde(skip_serializing_if = "Option::is_none")]
> +    pub collection_interval: Option<u64>,
> +
> +    /// Maximum number of concurrent connections while collecting metrics.
> +    #[serde(skip_serializing_if = "Option::is_none")]
> +    pub max_concurrent_connections: Option<usize>,
> +
> +    /// Maximum offset in seconds for the metric collection interval.
> +    #[serde(skip_serializing_if = "Option::is_none")]
> +    pub max_interval_offset: Option<u64>,
> +
> +    /// Minimum offset in seconds for the metric collection interval.
> +    #[serde(skip_serializing_if = "Option::is_none")]
> +    pub min_interval_offset: Option<u64>,
> +
> +    /// Maximum random delay in milliseconds for each outgoing connection.
> +    #[serde(skip_serializing_if = "Option::is_none")]
> +    pub max_connection_delay: Option<u64>,
> +
> +    /// Minimum random delay in milliseconds for each outgoing connection.
> +    #[serde(skip_serializing_if = "Option::is_none")]
> +    pub min_connection_delay: Option<u64>,
> +}
> +
> +#[api]
> +#[derive(Copy, Clone, Deserialize, Serialize)]
> +#[serde(rename_all = "kebab-case")]
> +/// Deletable property for [`CollectionSettings`].
> +pub enum DeletableCollectionSettingsProperty {
> +    /// Delete 'collection-interval'.
> +    CollectionInterval,
> +    /// Delete 'max-concurrent-connections'.
> +    MaxConcurrentConnections,
> +    /// Delete 'max-interval-offset'.
> +    MaxIntervalOffset,
> +    /// Delete 'min-interval-offset'.
> +    MinIntervalOffset,
> +    /// Delete 'min-connection-delay'.
> +    MaxConnectionDelay,
> +    /// Delete 'min-connection-delay'.
> +    MinConnectionDelay,
> +}
> +
> +impl CollectionSettings {
> +    /// Return the collection interval or the default if not configured.
> +    pub fn collection_interval_or_default(&self) -> u64 {
> +        self.collection_interval
> +            .unwrap_or(DEFAULT_COLLECTION_INTERVAL)
> +    }
> +
> +    /// Return the number of allowed concurrent connections or the default if not configured.
> +    pub fn max_concurrent_connections_or_default(&self) -> usize {
> +        self.max_concurrent_connections
> +            .unwrap_or(DEFAULT_CONCURRENCY)
> +    }
> +
> +    /// Return the upper bound for the main loop delay or the default if not configured.
> +    pub fn max_interval_offset_or_default(&self) -> u64 {
> +        self.max_interval_offset.unwrap_or(DEFAULT_MAX_OFFSET)
> +    }
> +
> +    /// Return the lower bound for the main loop delay or the default if not configured.
> +    pub fn min_interval_offset_or_default(&self) -> u64 {
> +        self.min_interval_offset.unwrap_or(DEFAULT_MIN_OFFSET)
> +    }
> +
> +    /// Return the upper bound for the connection delay or the default if not configured.
> +    pub fn max_connection_delay_or_default(&self) -> u64 {
> +        self.max_connection_delay
> +            .unwrap_or(DEFAULT_MAX_CONNECTION_DELAY)
> +    }
> +
> +    /// Return the lower bound for the connection delay or the default if not configured.
> +    pub fn min_connection_delay_or_default(&self) -> u64 {
> +        self.min_connection_delay
> +            .unwrap_or(DEFAULT_MIN_CONNECTION_DELAY)
> +    }
> +}



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 06/25] metric collection: save metric data to RRD in separate task
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 06/25] metric collection: save metric data to RRD in separate task Lukas Wagner
@ 2025-02-12 13:59   ` Wolfgang Bumiller
  2025-02-12 14:32     ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-12 13:59 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Tue, Feb 11, 2025 at 01:05:22PM +0100, Lukas Wagner wrote:
> This is a preparation for concurrent metric collection. While the RRD
> cache appears to be safe to concurrent access (it uses an RwLock to
> protect the data), delegating all RRD writes to a separate tokio task
> makes it IMO easier to reason about and understand. Furthermore, it
> decouples the 'metric fetching' and 'metric
> storing' parts, making it much easier to write tests for both parts
> independently.
> 
> For better code separation, this also splits out the new task into
> a new submodule.
> 
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  server/src/metric_collection/mod.rs      | 102 +++++++----------------
>  server/src/metric_collection/rrd_task.rs |  92 ++++++++++++++++++++
>  2 files changed, 124 insertions(+), 70 deletions(-)
>  create mode 100644 server/src/metric_collection/rrd_task.rs
> 
> diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
> index 5540f93..06ade5f 100644
> --- a/server/src/metric_collection/mod.rs
> +++ b/server/src/metric_collection/mod.rs
> @@ -2,18 +2,18 @@ use std::collections::HashMap;
>  use std::pin::pin;
>  
>  use anyhow::Error;
> -
> -use pbs_api_types::{MetricDataPoint, MetricDataType};
> -use proxmox_rrd::rrd::DataSourceType;
> +use tokio::sync::mpsc::{self, Sender};
>  
>  use pdm_api_types::remotes::RemoteType;
> -use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
>  
>  use crate::{connection, task_utils};
>  
>  pub mod rrd_cache;
> +mod rrd_task;
>  pub mod top_entities;
>  
> +use rrd_task::RrdStoreRequest;
> +
>  const COLLECTION_INTERVAL: u64 = 60;
>  
>  /// Initialize the RRD cache
> @@ -25,14 +25,22 @@ pub fn init() -> Result<(), Error> {
>  
>  /// Start the metric collection task.
>  pub fn start_task() {
> +    let (tx, rx) = mpsc::channel(128);
> +
> +    tokio::spawn(async move {
> +        let task_scheduler = pin!(metric_collection_task(tx));
> +        let abort_future = pin!(proxmox_daemon::shutdown_future());
> +        futures::future::select(task_scheduler, abort_future).await;
> +    });
> +
>      tokio::spawn(async move {
> -        let task_scheduler = pin!(metric_collection_task());
> +        let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
>          let abort_future = pin!(proxmox_daemon::shutdown_future());
>          futures::future::select(task_scheduler, abort_future).await;
>      });
>  }
>  
> -async fn metric_collection_task() -> Result<(), Error> {
> +async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
>      let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
>  
>      loop {
> @@ -59,37 +67,33 @@ async fn metric_collection_task() -> Result<(), Error> {
>                              .cluster_metrics_export(Some(true), Some(false), Some(start_time))
>                              .await?;
>  
> -                        //// Involves some blocking file IO
> -                        tokio::task::spawn_blocking(move || {
> -                            let mut most_recent_timestamp = 0;
> +                        let most_recent =
> +                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));

Might be more readable using `Iterator::max()`:

    metrics.data.iter().map(|e| e.timestamp).max().unwrap_or(0);

(while this changes throughout the series, there's at least 1 instance
of this pattern remaining at the end AFAICT)

>  
> -                            for data_point in metrics.data {
> -                                most_recent_timestamp =
> -                                    most_recent_timestamp.max(data_point.timestamp);
> -                                store_metric_pve(&remote_name_clone, &data_point);
> -                            }
> +                        sender
> +                            .send(RrdStoreRequest::Pve {
> +                                remote: remote_name_clone,
> +                                metrics,
> +                            })
> +                            .await?;
>  
> -                            most_recent_timestamp
> -                        })
> -                        .await
> +                        Ok::<i64, Error>(most_recent)

^ This...

>                      }
>                      RemoteType::Pbs => {
>                          let client = connection::make_pbs_client(remote)?;
>                          let metrics = client.metrics(Some(true), Some(start_time)).await?;
>  
> -                        // Involves some blocking file IO
> -                        tokio::task::spawn_blocking(move || {
> -                            let mut most_recent_timestamp = 0;
> +                        let most_recent =
> +                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
>  
> -                            for data_point in metrics.data {
> -                                most_recent_timestamp =
> -                                    most_recent_timestamp.max(data_point.timestamp);
> -                                store_metric_pbs(&remote_name_clone, &data_point);
> -                            }
> +                        sender
> +                            .send(RrdStoreRequest::Pbs {
> +                                remote: remote_name_clone,
> +                                metrics,
> +                            })
> +                            .await?;
>  
> -                            most_recent_timestamp
> -                        })
> -                        .await
> +                        Ok::<i64, Error>(most_recent)

^ ... and this are now the only values for a single `let` binding which
now looks like:

    let x = big_always_Ok_code?; // `?` op on always-Ok()
    Ok(x) // and re-OK-wrap

Could just drop the `Ok` and `?`.

>                      }
>                  }?;
>  
> @@ -106,45 +110,3 @@ async fn metric_collection_task() -> Result<(), Error> {
>          }
>      }
>  }
> -
> -fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
> -    let name = format!(
> -        "pve/{remote_name}/{id}/{metric}",
> -        id = data_point.id,
> -        metric = data_point.metric,
> -    );
> -
> -    let data_source_type = match data_point.ty {
> -        ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
> -        ClusterMetricsDataType::Counter => DataSourceType::Counter,
> -        ClusterMetricsDataType::Derive => DataSourceType::Derive,
> -    };
> -
> -    rrd_cache::update_value(
> -        &name,
> -        data_point.value,
> -        data_point.timestamp,
> -        data_source_type,
> -    );
> -}
> -
> -fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
> -    let name = format!(
> -        "pbs/{remote_name}/{id}/{metric}",
> -        id = data_point.id,
> -        metric = data_point.metric,
> -    );
> -
> -    let data_source_type = match data_point.ty {
> -        MetricDataType::Gauge => DataSourceType::Gauge,
> -        MetricDataType::Counter => DataSourceType::Counter,
> -        MetricDataType::Derive => DataSourceType::Derive,
> -    };
> -
> -    rrd_cache::update_value(
> -        &name,
> -        data_point.value,
> -        data_point.timestamp,
> -        data_source_type,
> -    );
> -}
> diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
> new file mode 100644
> index 0000000..fe19580
> --- /dev/null
> +++ b/server/src/metric_collection/rrd_task.rs
> @@ -0,0 +1,92 @@
> +use anyhow::Error;
> +use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
> +use proxmox_rrd::rrd::DataSourceType;
> +use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
> +use tokio::sync::mpsc::Receiver;
> +
> +use super::rrd_cache;
> +
> +/// Store requrest for the RRD task.
> +pub(super) enum RrdStoreRequest {
> +    /// Store PVE metrics.
> +    Pve {
> +        /// Remote name
> +        remote: String,
> +        /// Metric data
> +        metrics: ClusterMetrics,
> +    },
> +    /// Store PBS metrics.
> +    Pbs {
> +        /// Remote name
> +        remote: String,
> +        /// Metric data
> +        metrics: Metrics,
> +    },
> +}
> +
> +/// Task which stores received metrics in the RRD. Metric data is fed into
> +/// this task via a MPSC channel.
> +pub(super) async fn store_in_rrd_task(
> +    mut receiver: Receiver<RrdStoreRequest>,
> +) -> Result<(), Error> {
> +    while let Some(msg) = receiver.recv().await {
> +        //// Involves some blocking file IO
> +        tokio::task::spawn_blocking(move || match msg {
> +            RrdStoreRequest::Pve { remote, metrics } => {
> +                for data_point in metrics.data {
> +                    store_metric_pve(&remote, &data_point);
> +                }
> +            }
> +            RrdStoreRequest::Pbs { remote, metrics } => {
> +                for data_point in metrics.data {
> +                    store_metric_pbs(&remote, &data_point);
> +                }
> +            }
> +        })
> +        .await?;

^ This error won't be shown anywhere as this is the `spawn()`ed task's
entry point, and the error will stop the task for the rest of the
daemon's lifetime.

In general I prefer "detached" tasks to always return `()`.

Perhaps we should have a helper for the
`spawn_task_aborted_on_shutdown()` which enforces the `()` return type
in its generic bounds as well as a version which simply logs errors if
they are returned where we also warn in the docs that the task will then
be over for the rest of the daemon's lifetime.

> +    }
> +
> +    Ok(())
> +}
> +
> +fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
> +    let name = format!(
> +        "pve/{remote_name}/{id}/{metric}",
> +        id = data_point.id,
> +        metric = data_point.metric,
> +    );
> +
> +    let data_source_type = match data_point.ty {
> +        ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
> +        ClusterMetricsDataType::Counter => DataSourceType::Counter,
> +        ClusterMetricsDataType::Derive => DataSourceType::Derive,
> +    };
> +
> +    rrd_cache::update_value(
> +        &name,
> +        data_point.value,
> +        data_point.timestamp,
> +        data_source_type,
> +    );
> +}
> +
> +fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
> +    let name = format!(
> +        "pbs/{remote_name}/{id}/{metric}",
> +        id = data_point.id,
> +        metric = data_point.metric,
> +    );
> +
> +    let data_source_type = match data_point.ty {
> +        MetricDataType::Gauge => DataSourceType::Gauge,
> +        MetricDataType::Counter => DataSourceType::Counter,
> +        MetricDataType::Derive => DataSourceType::Derive,
> +    };
> +
> +    rrd_cache::update_value(
> +        &name,
> +        data_point.value,
> +        data_point.timestamp,
> +        data_source_type,
> +    );
> +}
> -- 
> 2.39.5


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 06/25] metric collection: save metric data to RRD in separate task
  2025-02-12 13:59   ` Wolfgang Bumiller
@ 2025-02-12 14:32     ` Lukas Wagner
  0 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-12 14:32 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel



On  2025-02-12 14:59, Wolfgang Bumiller wrote:
> On Tue, Feb 11, 2025 at 01:05:22PM +0100, Lukas Wagner wrote:
>> This is a preparation for concurrent metric collection. While the RRD
>> cache appears to be safe to concurrent access (it uses an RwLock to
>> protect the data), delegating all RRD writes to a separate tokio task
>> makes it IMO easier to reason about and understand. Furthermore, it
>> decouples the 'metric fetching' and 'metric
>> storing' parts, making it much easier to write tests for both parts
>> independently.
>>
>> For better code separation, this also splits out the new task into
>> a new submodule.
>>
>> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
>> ---
>>  server/src/metric_collection/mod.rs      | 102 +++++++----------------
>>  server/src/metric_collection/rrd_task.rs |  92 ++++++++++++++++++++
>>  2 files changed, 124 insertions(+), 70 deletions(-)
>>  create mode 100644 server/src/metric_collection/rrd_task.rs
>>
>> diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
>> index 5540f93..06ade5f 100644
>> --- a/server/src/metric_collection/mod.rs
>> +++ b/server/src/metric_collection/mod.rs
[ .. ]
> 
> Might be more readable using `Iterator::max()`:
> 
>     metrics.data.iter().map(|e| e.timestamp).max().unwrap_or(0);
> 
> (while this changes throughout the series, there's at least 1 instance
> of this pattern remaining at the end AFAICT)

Right, that's a tiny bit nicer to read. 
In the final code there is one occurrence left in the test code, I'll change it there.
> 
>>  
>> -                            for data_point in metrics.data {
>> -                                most_recent_timestamp =
>> -                                    most_recent_timestamp.max(data_point.timestamp);
>> -                                store_metric_pve(&remote_name_clone, &data_point);
>> -                            }
>> +                        sender
>> +                            .send(RrdStoreRequest::Pve {
>> +                                remote: remote_name_clone,
>> +                                metrics,
>> +                            })
>> +                            .await?;
>>  
>> -                            most_recent_timestamp
>> -                        })
>> -                        .await
>> +                        Ok::<i64, Error>(most_recent)
> 
> ^ This...
> 
>>                      }
>>                      RemoteType::Pbs => {
>>                          let client = connection::make_pbs_client(remote)?;
>>                          let metrics = client.metrics(Some(true), Some(start_time)).await?;
>>  
>> -                        // Involves some blocking file IO
>> -                        tokio::task::spawn_blocking(move || {
>> -                            let mut most_recent_timestamp = 0;
>> +                        let most_recent =
>> +                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
>>  
>> -                            for data_point in metrics.data {
>> -                                most_recent_timestamp =
>> -                                    most_recent_timestamp.max(data_point.timestamp);
>> -                                store_metric_pbs(&remote_name_clone, &data_point);
>> -                            }
>> +                        sender
>> +                            .send(RrdStoreRequest::Pbs {
>> +                                remote: remote_name_clone,
>> +                                metrics,
>> +                            })
>> +                            .await?;
>>  
>> -                            most_recent_timestamp
>> -                        })
>> -                        .await
>> +                        Ok::<i64, Error>(most_recent)
> 
> ^ ... and this are now the only values for a single `let` binding which
> now looks like:
> 
>     let x = big_always_Ok_code?; // `?` op on always-Ok()
>     Ok(x) // and re-OK-wrap
> 
> Could just drop the `Ok` and `?`.
> 

This changes in a later commit where overall error handling changes - so this
is not relevant in the final code.
Anyway, thanks for pointing it out.

>>                      }
>>                  }?;
>>  
>> @@ -106,45 +110,3 @@ async fn metric_collection_task() -> Result<(), Error> {
>>          }
>>      }
>>  }
>> -
>> -fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
>> -    let name = format!(
>> -        "pve/{remote_name}/{id}/{metric}",
>> -        id = data_point.id,
>> -        metric = data_point.metric,
>> -    );
>> -
>> -    let data_source_type = match data_point.ty {
>> -        ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
>> -        ClusterMetricsDataType::Counter => DataSourceType::Counter,
>> -        ClusterMetricsDataType::Derive => DataSourceType::Derive,
>> -    };
>> -
>> -    rrd_cache::update_value(
>> -        &name,
>> -        data_point.value,
>> -        data_point.timestamp,
>> -        data_source_type,
>> -    );
>> -}
>> -
>> -fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
>> -    let name = format!(
>> -        "pbs/{remote_name}/{id}/{metric}",
>> -        id = data_point.id,
>> -        metric = data_point.metric,
>> -    );
>> -
>> -    let data_source_type = match data_point.ty {
>> -        MetricDataType::Gauge => DataSourceType::Gauge,
>> -        MetricDataType::Counter => DataSourceType::Counter,
>> -        MetricDataType::Derive => DataSourceType::Derive,
>> -    };
>> -
>> -    rrd_cache::update_value(
>> -        &name,
>> -        data_point.value,
>> -        data_point.timestamp,
>> -        data_source_type,
>> -    );
>> -}
>> diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
>> new file mode 100644
>> index 0000000..fe19580
>> --- /dev/null
>> +++ b/server/src/metric_collection/rrd_task.rs
>> @@ -0,0 +1,92 @@
>> +use anyhow::Error;
>> +use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
>> +use proxmox_rrd::rrd::DataSourceType;
>> +use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
>> +use tokio::sync::mpsc::Receiver;
>> +
>> +use super::rrd_cache;
>> +
>> +/// Store requrest for the RRD task.
>> +pub(super) enum RrdStoreRequest {
>> +    /// Store PVE metrics.
>> +    Pve {
>> +        /// Remote name
>> +        remote: String,
>> +        /// Metric data
>> +        metrics: ClusterMetrics,
>> +    },
>> +    /// Store PBS metrics.
>> +    Pbs {
>> +        /// Remote name
>> +        remote: String,
>> +        /// Metric data
>> +        metrics: Metrics,
>> +    },
>> +}
>> +
>> +/// Task which stores received metrics in the RRD. Metric data is fed into
>> +/// this task via a MPSC channel.
>> +pub(super) async fn store_in_rrd_task(
>> +    mut receiver: Receiver<RrdStoreRequest>,
>> +) -> Result<(), Error> {
>> +    while let Some(msg) = receiver.recv().await {
>> +        //// Involves some blocking file IO
>> +        tokio::task::spawn_blocking(move || match msg {
>> +            RrdStoreRequest::Pve { remote, metrics } => {
>> +                for data_point in metrics.data {
>> +                    store_metric_pve(&remote, &data_point);
>> +                }
>> +            }
>> +            RrdStoreRequest::Pbs { remote, metrics } => {
>> +                for data_point in metrics.data {
>> +                    store_metric_pbs(&remote, &data_point);
>> +                }
>> +            }
>> +        })
>> +        .await?;
> 
> ^ This error won't be shown anywhere as this is the `spawn()`ed task's
> entry point, and the error will stop the task for the rest of the
> daemon's lifetime.

Good point, I'll drop the `?` and log the error, if it occurs.

> 
> In general I prefer "detached" tasks to always return `()`.
> 
> Perhaps we should have a helper for the
> `spawn_task_aborted_on_shutdown()` which enforces the `()` return type
> in its generic bounds as well as a version which simply logs errors if
> they are returned where we also warn in the docs that the task will then
> be over for the rest of the daemon's lifetime.
> 
>> +    }
>> +
>> +    Ok(())
>> +}
>> +
>> +fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
>> +    let name = format!(
>> +        "pve/{remote_name}/{id}/{metric}",
>> +        id = data_point.id,
>> +        metric = data_point.metric,
>> +    );
>> +
>> +    let data_source_type = match data_point.ty {
>> +        ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
>> +        ClusterMetricsDataType::Counter => DataSourceType::Counter,
>> +        ClusterMetricsDataType::Derive => DataSourceType::Derive,
>> +    };
>> +
>> +    rrd_cache::update_value(
>> +        &name,
>> +        data_point.value,
>> +        data_point.timestamp,
>> +        data_source_type,
>> +    );
>> +}
>> +
>> +fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
>> +    let name = format!(
>> +        "pbs/{remote_name}/{id}/{metric}",
>> +        id = data_point.id,
>> +        metric = data_point.metric,
>> +    );
>> +
>> +    let data_source_type = match data_point.ty {
>> +        MetricDataType::Gauge => DataSourceType::Gauge,
>> +        MetricDataType::Counter => DataSourceType::Counter,
>> +        MetricDataType::Derive => DataSourceType::Derive,
>> +    };
>> +
>> +    rrd_cache::update_value(
>> +        &name,
>> +        data_point.value,
>> +        data_point.timestamp,
>> +        data_source_type,
>> +    );
>> +}
>> -- 
>> 2.39.5

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task Lukas Wagner
  2025-02-11 12:58   ` Lukas Wagner
@ 2025-02-12 15:57   ` Wolfgang Bumiller
  2025-02-13 12:31     ` Lukas Wagner
  1 sibling, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-12 15:57 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

Looks good.
One improvement/follow-up suggestion (JoinSet), some nitpicking &
potential refactoring requests...

On Tue, Feb 11, 2025 at 01:05:23PM +0100, Lukas Wagner wrote:
> Metric collection is mostly limited by network latency, not disk IO or
> CPU. This means we should be able get significant speedups by polling
> remotes concurrently by spawning tokio workers.
> 
> To avoid load/IO spikes, we limit the number of concurrent connections
> using a semaphore.
> 
> Each concurrent task which fetches a single remote waits a random amount
> of time before actually connecting to the remote. The upper and lower
> bounds for this random delay can be configured in the metric collection
> settings. The main aim of this mechanism is to reduce load spikes.
> 
> Furthermore, each time when the main collection interval timer
> fires, a random delay is introduced before actually starting the
> collection process.
> This is useful due to how we set up the timer; we configure
> it to figure at aligned points in time. For instance, if the
> collection interval is set to 60s, the timer fires at
> minute boundaries. Using this random offset, we can avoid
> triggering at the same time as other timers (cron, systemd).
> 
> Furthermore, we add a mechanism to trigger metric collection manually.
> This is achieve by using `tokio::select!` on both, the timer's `tick`
> method and the `recv` method of an `mpsc` which is used to send control
> messages to the metric collection task.
> 
> For better code structure, the collection task is split into a separate
> module.
> 
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  Cargo.toml                                    |   1 +
>  server/Cargo.toml                             |   1 +
>  server/src/bin/proxmox-datacenter-api.rs      |   2 +-
>  .../src/metric_collection/collection_task.rs  | 317 ++++++++++++++++++
>  server/src/metric_collection/mod.rs           | 128 +++----
>  5 files changed, 366 insertions(+), 83 deletions(-)
>  create mode 100644 server/src/metric_collection/collection_task.rs
> 
> diff --git a/Cargo.toml b/Cargo.toml
> index 4f3b1d0..7dd60a5 100644
> --- a/Cargo.toml
> +++ b/Cargo.toml
> @@ -108,6 +108,7 @@ once_cell = "1.3.1"
>  openssl = "0.10.40"
>  percent-encoding = "2.1"
>  pin-project-lite = "0.2"
> +rand = "0.8"
>  regex = "1.5.5"
>  serde = { version = "1.0", features = ["derive"] }
>  serde_cbor = "0.11.1"
> diff --git a/server/Cargo.toml b/server/Cargo.toml
> index 7b0058e..c8308f2 100644
> --- a/server/Cargo.toml
> +++ b/server/Cargo.toml
> @@ -24,6 +24,7 @@ nix.workspace = true
>  once_cell.workspace = true
>  openssl.workspace = true
>  percent-encoding.workspace = true
> +rand.workspace = true
>  serde.workspace = true
>  serde_json.workspace = true
>  syslog.workspace = true
> diff --git a/server/src/bin/proxmox-datacenter-api.rs b/server/src/bin/proxmox-datacenter-api.rs
> index a79094d..6e85e52 100644
> --- a/server/src/bin/proxmox-datacenter-api.rs
> +++ b/server/src/bin/proxmox-datacenter-api.rs
> @@ -286,7 +286,7 @@ async fn run(debug: bool) -> Result<(), Error> {
>      });
>  
>      start_task_scheduler();
> -    metric_collection::start_task();
> +    metric_collection::start_task()?;
>      resource_cache::start_task();
>  
>      server.await?;
> diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
> new file mode 100644
> index 0000000..8420039
> --- /dev/null
> +++ b/server/src/metric_collection/collection_task.rs
> @@ -0,0 +1,317 @@
> +use std::{collections::HashMap, sync::Arc, time::Duration};
> +
> +use anyhow::Error;
> +use rand::Rng;
> +use tokio::{
> +    sync::{
> +        mpsc::{Receiver, Sender},
> +        OwnedSemaphorePermit, Semaphore,
> +    },
> +    time::Interval,
> +};
> +
> +use proxmox_section_config::typed::SectionConfigData;
> +
> +use pdm_api_types::{
> +    remotes::{Remote, RemoteType},
> +    CollectionSettings,
> +};
> +use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
> +
> +use crate::{connection, task_utils};
> +
> +use super::rrd_task::RrdStoreRequest;
> +
> +/// Control messages for the metric collection task.
> +pub(super) enum ControlMsg {
> +    CollectSingleRemote(String),
> +    CollectAllRemotes,
> +}
> +
> +/// Task which periodically collects metrics from all remotes and stores
> +/// them in the local metrics database.
> +pub(super) struct MetricCollectionTask {
> +    most_recent_timestamps: HashMap<String, i64>,
> +    settings: CollectionSettings,
> +    metric_data_tx: Sender<RrdStoreRequest>,
> +    control_message_rx: Receiver<ControlMsg>,
> +}
> +
> +impl MetricCollectionTask {
> +    /// Create a new metric collection task.
> +    pub(super) fn new(
> +        metric_data_tx: Sender<RrdStoreRequest>,
> +        control_message_rx: Receiver<ControlMsg>,
> +    ) -> Result<Self, Error> {
> +        let settings = Self::get_settings_or_default();
> +
> +        Ok(Self {
> +            most_recent_timestamps: HashMap::new(),
> +            settings,
> +            metric_data_tx,
> +            control_message_rx,
> +        })
> +    }
> +
> +    /// Run the metric collection task.
> +    ///
> +    /// This function does never return.
> +    #[tracing::instrument(skip_all, name = "metric_collection_task")]
> +    pub(super) async fn run(&mut self) {
> +        let mut timer = Self::setup_timer(self.settings.collection_interval_or_default());
> +
> +        log::debug!(
> +            "metric collection starting up. collection interval set to {} seconds",

^ nit: using punctuation in the middle but continuing in lower case and
not ending with punctuation ðŸĪŠ

> +            self.settings.collection_interval_or_default()
> +        );
> +
> +        loop {
> +            let old_settings = self.settings.clone();
> +            tokio::select! {
> +                _ = timer.tick() => {
> +                    // Reload settings in case they have changed in the meanwhile
> +                    self.settings = Self::get_settings_or_default();
> +
> +                    log::debug!("starting metric collection from all remotes - triggered by timer");

Not sure if it's worth moving this into a `self.on_tick()` but...

> +                    self.sleep_for_random_interval_offset().await;
> +
> +                    if let Some(remotes) = Self::load_remote_config() {
> +                        let to_fetch = remotes.order.as_slice();
> +                        self.fetch_remotes(&remotes, to_fetch).await;
> +                    }
> +                }
> +
> +                val = self.control_message_rx.recv() => {
> +                    // Reload settings in case they have changed in the meanwhile
> +                    self.settings = Self::get_settings_or_default();
> +                    match val {

...but I think this part should be factored out into a
`self.handle_control_message(msg).await`

It gets indented quite deeply and it just makes sense IMO :)

> +                        Some(ControlMsg::CollectSingleRemote(remote)) => {
> +                            if let Some(remotes) = Self::load_remote_config() {

^ Also this...

> +                                log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
> +                                self.fetch_remotes(&remotes, &[remote]).await;
> +                            }
> +                        }
> +                        Some(ControlMsg::CollectAllRemotes) => {
> +                            if let Some(remotes) = Self::load_remote_config() {

... is the same as this, so if this is factored out into a separate
function, it could just early-out via a let-else there, as we don't need
to call it when `val` is `None`, and then it's quite compact.

> +                                log::debug!("starting metric collection from all remotes - triggered by control message");
> +                                self.fetch_remotes(&remotes, &remotes.order).await;
> +                            }
> +                        }
> +                        _ => {},
> +                    }
> +                }
> +            }
> +
> +            let interval = self.settings.collection_interval_or_default();
> +
> +            if old_settings.collection_interval_or_default() != interval {
> +                log::info!(
> +                    "metric collection interval changed to {} seconds, reloading timer",
> +                    interval
> +                );
> +                timer = Self::setup_timer(interval);
> +            }
> +        }
> +    }
> +
> +    async fn sleep_for_random_interval_offset(&self) {
> +        let mut min = self.settings.min_interval_offset_or_default();
> +        let max = self.settings.max_interval_offset_or_default();
> +
> +        if min > max {
> +            log::warn!(
> +                "min-interval-offset is larger than max-interval-offset ({min} > {max}) - \
> +                capping it to max-interval-offset ({max})"
> +            );
> +            min = max;
> +        }
> +
> +        let jitter = {
> +            let mut rng = rand::thread_rng();
> +            rng.gen_range(min..=max)
> +        };
> +        tokio::time::sleep(Duration::from_secs(jitter)).await;
> +    }
> +

↑ and ↓ are crying out for a common fn with the variable name, min and
max as parameters ;-)

> +    async fn sleep_for_random_connection_delay(settings: &CollectionSettings) {
> +        let mut min = settings.min_connection_delay_or_default();
> +        let max = settings.max_connection_delay_or_default();
> +
> +        if min > max {
> +            log::warn!(
> +                "min-collection-delay is larger than max-collection-delay ({min} > {max}) - \
> +                capping it to max-collection-delay ({max})"
> +            );
> +            min = max;
> +        }
> +
> +        let jitter = {
> +            let mut rng = rand::thread_rng();
> +            rng.gen_range(min..=max)
> +        };
> +
> +        tokio::time::sleep(Duration::from_millis(jitter)).await;
> +    }
> +
> +    fn get_settings_or_default() -> CollectionSettings {
> +        // This function is a bit odd, but not sure if there is a much nicer
> +        // way to do it. We want to fall back to defaults if
> +        //   - the config file does not exist (no errors logged)
> +        //   - if section type is wrong or the config failed to parse (log errors)

It gets a bit shorter and if the `_impl` returns a result+option.
The match below can just `return settings` in the first case, and have
an `Ok(None) => (),`, the `Err` only has the log line and the default is
moved to after the match.

Alternatively a `default()` helper could also shorten it
(`Ok(default())` in the `_impl` and
`get_settings_impl().unwrap_or_elese(|err| { log(err); default() })` is
also shorter :-)

> +
> +        fn get_settings_impl() -> Result<CollectionSettings, Error> {
> +            let (config, _) = pdm_config::metric_collection::config()?;
> +
> +            let all_sections: Vec<CollectionSettings> =
> +                config.convert_to_typed_array(COLLECTION_SETTINGS_TYPE)?;
> +
> +            for section in all_sections {
> +                if section.id == "default" {
> +                    return Ok(section);
> +                }
> +            }
> +
> +            Ok(CollectionSettings {
> +                id: "default".into(),
> +                ..Default::default()
> +            })
> +        }
> +
> +        match get_settings_impl() {
> +            Ok(settings) => settings,
> +            Err(err) => {
> +                log::error!("could not read metric collection settings: {err} - falling back to default config");
> +
> +                CollectionSettings {
> +                    id: "default".into(),
> +                    ..Default::default()
> +                }
> +            }
> +        }
> +    }
> +
> +    /// Set up a [`tokio::time::Interval`] instance with the provided interval.
> +    /// The timer will be aligned, e.g. an interval of `60` will let the timer
> +    /// fire at minute boundaries.
> +    fn setup_timer(interval: u64) -> Interval {
> +        let mut timer = tokio::time::interval(Duration::from_secs(interval));
> +        let first_run = task_utils::next_aligned_instant(interval).into();
> +        timer.reset_at(first_run);
> +
> +        timer
> +    }
> +
> +    /// Convenience helper to load `remote.cfg`, logging the error
> +    /// and returning `None` if the config could not be read.
> +    fn load_remote_config() -> Option<SectionConfigData<Remote>> {
> +        match pdm_config::remotes::config() {
> +            Ok((remotes, _)) => Some(remotes),
> +            Err(e) => {
> +                log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
> +                None
> +            }
> +        }
> +    }
> +
> +    /// Fetch metric data from a provided list of remotes concurrently.
> +    /// The maximum number of concurrent connections is determined by
> +    /// `max_concurrent_connections` in the [`CollectionSettings`]
> +    /// instance in `self`.
> +    async fn fetch_remotes(
> +        &mut self,
> +        config: &SectionConfigData<Remote>,
> +        remotes_to_fetch: &[String],
> +    ) {
> +        let semaphore = Arc::new(Semaphore::new(
> +            self.settings.max_concurrent_connections_or_default(),
> +        ));
> +        let mut handles = Vec::new();

Not sure how "robust" the spawning is wrt. ordering - and this doesn't
need to happen in this patch (unless it's easy), but a follow-up could
probably change this into a tokio `JoinSet`, that way the await-loop
below can use `.join_next()`, so that for example if an early task
happens to end up finishing last, the loop still collects the remaining
logs in a more "temporally accurate" way...

> +
> +        for remote_name in remotes_to_fetch {
> +            let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
> +
> +            // unwrap is okay here, acquire_* will only fail if `close` has been
> +            // called on the semaphore.
> +            let permit = semaphore.clone().acquire_owned().await.unwrap();

^ Nit: where possible, I prefer `Arc::clone(&ptr)` just for the sake of
not accidentally `.clone()`ing something large after later a refactor
where the clones aren't in the patch context lines.

> +
> +            if let Some(remote) = config.get(remote_name).cloned() {
> +                log::debug!("fetching remote '{}'", remote.id);
> +                let handle = tokio::spawn(Self::fetch_single_remote(
> +                    self.settings.clone(),
> +                    remote,
> +                    start_time,
> +                    self.metric_data_tx.clone(),
> +                    permit,
> +                ));
> +
> +                handles.push((remote_name.clone(), handle));
> +            }
> +        }
> +
> +        for (remote_name, handle) in handles {
> +            let res = handle.await;
> +
> +            match res {
> +                Ok(Ok(ts)) => {
> +                    self.most_recent_timestamps
> +                        .insert(remote_name.to_string(), ts);
> +                }
> +                Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
> +                Err(err) => {
> +                    log::error!(
> +                        "join error for metric collection task for remote {remote_name}: {err}"
> +                    )
> +                }
> +            }
> +        }
> +    }
> +
> +    /// Fetch a single remote.
> +    #[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
> +    async fn fetch_single_remote(
> +        settings: CollectionSettings,
> +        remote: Remote,
> +        start_time: i64,
> +        sender: Sender<RrdStoreRequest>,
> +        _permit: OwnedSemaphorePermit,
> +    ) -> Result<i64, Error> {
> +        Self::sleep_for_random_connection_delay(&settings).await;
> +
> +        let most_recent_timestamp = match remote.ty {
> +            RemoteType::Pve => {
> +                let client = connection::make_pve_client(&remote)?;
> +                let metrics = client
> +                    .cluster_metrics_export(Some(true), Some(false), Some(start_time))
> +                    .await?;
> +
> +                let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
> +
> +                sender
> +                    .send(RrdStoreRequest::Pve {
> +                        remote: remote.id.clone(),
> +                        metrics,
> +                    })
> +                    .await?;
> +
> +                most_recent
> +            }
> +            RemoteType::Pbs => {
> +                let client = connection::make_pbs_client(&remote)?;
> +                let metrics = client.metrics(Some(true), Some(start_time)).await?;
> +
> +                let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
> +
> +                sender
> +                    .send(RrdStoreRequest::Pbs {
> +                        remote: remote.id.clone(),
> +                        metrics,
> +                    })
> +                    .await?;
> +
> +                most_recent
> +            }
> +        };
> +
> +        Ok(most_recent_timestamp)
> +    }
> +}
> diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
> index 06ade5f..0ab6678 100644
> --- a/server/src/metric_collection/mod.rs
> +++ b/server/src/metric_collection/mod.rs
> @@ -1,20 +1,17 @@
> -use std::collections::HashMap;
>  use std::pin::pin;
> +use std::sync::OnceLock;
>  
> -use anyhow::Error;
> +use anyhow::{bail, Error};
>  use tokio::sync::mpsc::{self, Sender};
>  
> -use pdm_api_types::remotes::RemoteType;
> -
> -use crate::{connection, task_utils};
> -
> +mod collection_task;
>  pub mod rrd_cache;
>  mod rrd_task;
>  pub mod top_entities;
>  
> -use rrd_task::RrdStoreRequest;
> +use collection_task::{ControlMsg, MetricCollectionTask};
>  
> -const COLLECTION_INTERVAL: u64 = 60;
> +static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
>  
>  /// Initialize the RRD cache
>  pub fn init() -> Result<(), Error> {
> @@ -24,89 +21,56 @@ pub fn init() -> Result<(), Error> {
>  }
>  
>  /// Start the metric collection task.
> -pub fn start_task() {
> -    let (tx, rx) = mpsc::channel(128);
> +pub fn start_task() -> Result<(), Error> {
> +    let (metric_data_tx, metric_data_rx) = mpsc::channel(128);
> +
> +    let (trigger_collection_tx, trigger_collection_rx) = mpsc::channel(128);
> +    if CONTROL_MESSAGE_TX.set(trigger_collection_tx).is_err() {
> +        bail!("control message sender alread set");
> +    }
>  
>      tokio::spawn(async move {
> -        let task_scheduler = pin!(metric_collection_task(tx));
> +        let metric_collection_task_future = pin!(async move {
> +            let mut task =
> +                MetricCollectionTask::new(metric_data_tx, trigger_collection_rx).unwrap();
> +            task.run().await;

^ nit: could drop the let binding (then it IMO linewraps a bit nicer,
too)

> +        });
> +
>          let abort_future = pin!(proxmox_daemon::shutdown_future());
> -        futures::future::select(task_scheduler, abort_future).await;
> +        futures::future::select(metric_collection_task_future, abort_future).await;
>      });
>  
>      tokio::spawn(async move {
> -        let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
> +        let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
>          let abort_future = pin!(proxmox_daemon::shutdown_future());
> -        futures::future::select(task_scheduler, abort_future).await;
> +        futures::future::select(rrd_task_future, abort_future).await;
>      });
> +
> +    Ok(())
>  }
>  
> -async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
> -    let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
> -
> -    loop {
> -        let delay_target = task_utils::next_aligned_instant(COLLECTION_INTERVAL);
> -        tokio::time::sleep_until(tokio::time::Instant::from_std(delay_target)).await;
> -
> -        let remotes = match pdm_config::remotes::config() {
> -            Ok((remotes, _)) => remotes,
> -            Err(e) => {
> -                log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
> -                continue;
> -            }
> -        };
> -
> -        for (remote_name, remote) in &remotes.sections {
> -            let start_time = *most_recent_timestamps.get(remote_name).unwrap_or(&0);
> -            let remote_name_clone = remote_name.clone();
> -
> -            let res = async {
> -                let most_recent_timestamp = match remote.ty {
> -                    RemoteType::Pve => {
> -                        let client = connection::make_pve_client(remote)?;
> -                        let metrics = client
> -                            .cluster_metrics_export(Some(true), Some(false), Some(start_time))
> -                            .await?;
> -
> -                        let most_recent =
> -                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
> -
> -                        sender
> -                            .send(RrdStoreRequest::Pve {
> -                                remote: remote_name_clone,
> -                                metrics,
> -                            })
> -                            .await?;
> -
> -                        Ok::<i64, Error>(most_recent)
> -                    }
> -                    RemoteType::Pbs => {
> -                        let client = connection::make_pbs_client(remote)?;
> -                        let metrics = client.metrics(Some(true), Some(start_time)).await?;
> -
> -                        let most_recent =
> -                            metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
> -
> -                        sender
> -                            .send(RrdStoreRequest::Pbs {
> -                                remote: remote_name_clone,
> -                                metrics,
> -                            })
> -                            .await?;
> -
> -                        Ok::<i64, Error>(most_recent)
> -                    }
> -                }?;
> -
> -                Ok::<i64, Error>(most_recent_timestamp)
> -            }
> -            .await;
> -
> -            match res {
> -                Ok(ts) => {
> -                    most_recent_timestamps.insert(remote_name.to_string(), ts);
> -                }
> -                Err(err) => log::error!("failed to collect metrics for {remote_name}: {err}"),
> -            }
> -        }
> +/// Schedule metric collection for a given remote as soon as possible.
> +///
> +/// Has no effect if the tx end of the channel has not been initialized yet.
> +/// Returns an error if the mpsc channel has been closed already.
> +pub async fn trigger_metric_collection_for_remote(remote: &str) -> Result<(), Error> {
> +    if let Some(sender) = CONTROL_MESSAGE_TX.get() {

^ nit: This is the normal case (given that `CONTROL_MESSAGE_TX` gets
initialized in `stark_task()` on startup, so IMO the function signature
could already be `String` to allow the caller to move it (or `impl
Into<String>` for this to be implicit) - both callers could currently
move it ;-)

> +        sender
> +            .send(ControlMsg::CollectSingleRemote(remote.into()))
> +            .await?;
>      }
> +
> +    Ok(())
> +}
> +
> +/// Schedule metric collection for all remotes as soon as possible.
> +///
> +/// Has no effect if the tx end of the channel has not been initialized yet.
> +/// Returns an error if the mpsc channel has been closed already.
> +pub async fn trigger_metric_collection() -> Result<(), Error> {
> +    if let Some(sender) = CONTROL_MESSAGE_TX.get() {
> +        sender.send(ControlMsg::CollectAllRemotes).await?;
> +    }
> +
> +    Ok(())
>  }
> -- 
> 2.39.5


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
@ 2025-02-13  8:55   ` Wolfgang Bumiller
  2025-02-13 13:50     ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-13  8:55 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Tue, Feb 11, 2025 at 01:05:26PM +0100, Lukas Wagner wrote:
> Due to the fact that the timer fires at aligned points in time and might
> now fire right away after being set up, it could happen that we get gaps
> in the data if we change the timer interval or at daemon startup.
> 
> To mitigate this, on daemon startup and also if the collection interval
> changes, we
>   - check if the time until the next scheduled regular collection
>     plus the time *since* the last successful collection exceeds
>     the configured collection interval
>   - if yes, we collect immediately
>   - if no, we do nothing and let the remote be collected at the
>     next timer tick
> 
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  .../src/metric_collection/collection_task.rs  | 72 +++++++++++++++++--
>  1 file changed, 65 insertions(+), 7 deletions(-)
> 
> diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
> index b4e3207..f0742ea 100644
> --- a/server/src/metric_collection/collection_task.rs
> +++ b/server/src/metric_collection/collection_task.rs
> @@ -1,4 +1,7 @@
> -use std::{sync::Arc, time::Duration};
> +use std::{
> +    sync::Arc,
> +    time::{Duration, Instant},
> +};
>  
>  use anyhow::Error;
>  use rand::Rng;
> @@ -69,13 +72,18 @@ impl MetricCollectionTask {
>      /// This function does never return.
>      #[tracing::instrument(skip_all, name = "metric_collection_task")]
>      pub(super) async fn run(&mut self) {
> -        let mut timer = Self::setup_timer(self.settings.collection_interval_or_default());
> +        let (mut timer, mut next_run) =
> +            Self::setup_timer(self.settings.collection_interval_or_default());
>  
>          log::debug!(
>              "metric collection starting up. collection interval set to {} seconds",
>              self.settings.collection_interval_or_default()
>          );
>  
> +        // Check and fetch any remote which would be overdue by the time the
> +        // timer first fires.
> +        self.fetch_overdue_and_save_state(next_run).await;
> +
>          loop {
>              let old_settings = self.settings.clone();
>              tokio::select! {
> @@ -124,7 +132,12 @@ impl MetricCollectionTask {
>                      "metric collection interval changed to {} seconds, reloading timer",
>                      interval
>                  );
> -                timer = Self::setup_timer(interval);
> +                (timer, next_run) = Self::setup_timer(interval);
> +                // If change (and therefore reset) our timer right before it fires,
> +                // we could potentially miss one collection event.

Couldn't we instead just pass `next_run` through to `setup_timer` and
call `reset_at(next_run)` on it? (`first_run` would only be used in the
initial setup, so `next_run` could either be an `Option`, or the setup
code does the `next_aligned_instant` call...

This should be much less code by making the new
`fetch_overdue{,_and_save_sate}()` functions unnecessary, or am I
missing something?

> +                // Therefore fetch all remotes which would be due for metric collection before
> +                // the new timer fires.
> +                self.fetch_overdue_and_save_state(next_run).await;
>              }
>          }
>      }
> @@ -208,12 +221,12 @@ impl MetricCollectionTask {
>      /// Set up a [`tokio::time::Interval`] instance with the provided interval.
>      /// The timer will be aligned, e.g. an interval of `60` will let the timer
>      /// fire at minute boundaries.
> -    fn setup_timer(interval: u64) -> Interval {
> +    fn setup_timer(interval: u64) -> (Interval, Instant) {
>          let mut timer = tokio::time::interval(Duration::from_secs(interval));
> -        let first_run = task_utils::next_aligned_instant(interval).into();
> -        timer.reset_at(first_run);
> +        let first_run = task_utils::next_aligned_instant(interval);
> +        timer.reset_at(first_run.into());
>  
> -        timer
> +        (timer, first_run)
>      }
>  
>      /// Convenience helper to load `remote.cfg`, logging the error
> @@ -292,6 +305,51 @@ impl MetricCollectionTask {
>          }
>      }
>  
> +    /// Fetch metric data from remotes which are overdue for collection and save
> +    /// collection state.
> +    async fn fetch_overdue_and_save_state(&mut self, next_run: Instant) {
> +        if let Some(remotes) = Self::load_remote_config() {
> +            self.fetch_overdue(&remotes, next_run).await;
> +            if let Err(e) = self.state.save() {
> +                log::error!("could not update metric collection state: {e}");
> +            }
> +        }
> +    }
> +
> +    /// Fetch metric data from remotes which are overdue for collection.
> +    ///
> +    /// Use this on startup of the metric collection loop as well as
> +    /// when the collection interval changes.
> +    ///
> +    /// Does nothing if the remote config could not be read, in this case an
> +    /// error is logged.
> +    async fn fetch_overdue(&mut self, remotes: &SectionConfigData<Remote>, next_run: Instant) {
> +        let left_until_scheduled = next_run - Instant::now();
> +        let now = proxmox_time::epoch_i64();
> +
> +        let mut overdue = Vec::new();
> +
> +        for remote in &remotes.order {
> +            let last_collection = self
> +                .state
> +                .get_status(remote)
> +                .and_then(|s| s.last_collection)
> +                .unwrap_or(0);
> +
> +            let diff = now - last_collection;
> +
> +            if diff + left_until_scheduled.as_secs() as i64
> +                > self.settings.collection_interval_or_default() as i64
> +            {
> +                log::debug!(
> +                    "starting metric collection for remote '{remote}' - triggered because collection is overdue"
> +                );
> +                overdue.push(remote.clone());
> +            }
> +        }
> +        self.fetch_remotes(remotes, &overdue).await;
> +    }
> +
>      /// Fetch a single remote.
>      #[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
>      async fn fetch_single_remote(
> -- 
> 2.39.5
> 
> 
> 
> _______________________________________________
> pdm-devel mailing list
> pdm-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
> 
> 


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 17/25] metric collection: save time needed for collection run to RRD
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 17/25] metric collection: save time needed for collection run to RRD Lukas Wagner
@ 2025-02-13 11:53   ` Wolfgang Bumiller
  2025-02-13 12:12     ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-13 11:53 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Tue, Feb 11, 2025 at 01:05:33PM +0100, Lukas Wagner wrote:
> For large setups, it might be useful to know how much time was needed to
> collect metrics for *all* remotes together, e.g. for making sure that
> the collection interval is not exceeded.
> 
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  .../src/metric_collection/collection_task.rs  | 15 ++++++
>  server/src/metric_collection/rrd_task.rs      | 50 +++++++++++++++----
>  2 files changed, 56 insertions(+), 9 deletions(-)
> 
> diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
> index 27a9987..501df9b 100644
> --- a/server/src/metric_collection/collection_task.rs
> +++ b/server/src/metric_collection/collection_task.rs
> @@ -22,6 +22,7 @@ use pdm_api_types::{
>  };
>  use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
>  
> +use crate::metric_collection::rrd_task::CollectionStats;
>  use crate::{connection, task_utils};
>  
>  use super::{
> @@ -97,6 +98,20 @@ impl MetricCollectionTask {
>                      if let Some(remotes) = Self::load_remote_config() {
>                          let to_fetch = remotes.order.as_slice();
>                          self.fetch_remotes(&remotes, to_fetch).await;
> +
> +                        let now = Instant::now();
> +                        self.fetch_remotes(&remotes, &remotes.order).await;

2 `fetch_remotes` calls now look a bit weird ;-)
should probably fold in some later change?

> +                        let elapsed = now.elapsed();
> +
> +                        if let Err(err) = self.metric_data_tx.send(
> +                            RrdStoreRequest::CollectionStats {
> +                                timestamp: proxmox_time::epoch_i64(),
> +                                stats: CollectionStats {
> +                                    total_time: elapsed.as_secs_f64() * 1000.
> +                                }
> +                            }).await {
> +                            log::error!("could not send collection stats to rrd task: {err}");
> +                        }
>                      }
>                  }
>  


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 19/25] api: add endpoint for updating metric collection settings
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 19/25] api: add endpoint for updating metric collection settings Lukas Wagner
@ 2025-02-13 12:09   ` Wolfgang Bumiller
  2025-02-13 12:15     ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-13 12:09 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Tue, Feb 11, 2025 at 01:05:35PM +0100, Lukas Wagner wrote:
> This one lives under /config/metric-collection.
> At the moment, we do not offer full CRUD, but only offer a hard-coded
> collection settings instance at /config/metric-collection/default, which
> can be retrieved via GET and updated via PUT.
> 
> Later, we might update this to full CRUD, e.g. to have different
> settings for different 'poll-groups' consisting of a sub-set of remotes.
> 
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  server/src/api/config/metric_collection.rs | 166 +++++++++++++++++++++
>  server/src/api/config/mod.rs               |   2 +
>  2 files changed, 168 insertions(+)
>  create mode 100644 server/src/api/config/metric_collection.rs
> 
> diff --git a/server/src/api/config/metric_collection.rs b/server/src/api/config/metric_collection.rs
> new file mode 100644
> index 0000000..1146101
> --- /dev/null
> +++ b/server/src/api/config/metric_collection.rs
> @@ -0,0 +1,166 @@
> +use anyhow::{bail, Error};
> +
> +use proxmox_config_digest::ConfigDigest;
> +use proxmox_router::{list_subdirs_api_method, Router, RpcEnvironment, SubdirMap};
> +use proxmox_schema::api;
> +use proxmox_sortable_macro::sortable;
> +
> +use pdm_api_types::{
> +    CollectionSettings, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
> +};
> +use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
> +
> +#[sortable]
> +const SUBDIRS: SubdirMap = &sorted!([("default", &DEFAULT_ROUTER),]);
> +
> +pub const ROUTER: Router = Router::new()
> +    .get(&list_subdirs_api_method!(SUBDIRS))
> +    .subdirs(SUBDIRS);
> +
> +const DEFAULT_ROUTER: Router = Router::new()
> +    .get(&API_METHOD_GET_SETTINGS)
> +    .put(&API_METHOD_UPDATE_SETTINGS);
> +
> +#[api(
> +    protected: false,
> +)]
> +/// Get metric collection settings.
> +pub fn get_settings(rpcenv: &mut dyn RpcEnvironment) -> Result<CollectionSettings, Error> {
> +    let (config, digest) = pdm_config::metric_collection::config()?;
> +    rpcenv["digest"] = digest.to_hex().into();
> +
> +    if config.sections.contains_key("default") {
> +        let a = config.lookup(COLLECTION_SETTINGS_TYPE, "default")?;
> +        Ok(a)

tiny nit: Clippy would warn here, and 'a' is a weird name anyway ;-)
(drop the let binding)

> +    } else {
> +        Ok(CollectionSettings {
> +            id: "default".into(),
> +            ..Default::default()
> +        })
> +    }
> +}


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 24/25] pdm-client: add metric collection API methods
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 24/25] pdm-client: add metric collection API methods Lukas Wagner
@ 2025-02-13 12:10   ` Wolfgang Bumiller
  2025-02-13 13:52     ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-13 12:10 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Tue, Feb 11, 2025 at 01:05:40PM +0100, Lukas Wagner wrote:
> This adds bindings for collection settings, trigger, and status APIs.
> 
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  lib/pdm-client/src/lib.rs | 87 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 87 insertions(+)
> 
> diff --git a/lib/pdm-client/src/lib.rs b/lib/pdm-client/src/lib.rs
> index a41b82c..ae41e9a 100644
> --- a/lib/pdm-client/src/lib.rs
> +++ b/lib/pdm-client/src/lib.rs
> @@ -320,6 +320,93 @@ impl<T: HttpApiClient> PdmClient<T> {
>              .ok_or_else(|| Error::BadApi("api returned no webauthn entry id".to_string(), None))
>      }
>  
> +    /// Get global metric collection settings.
> +    pub async fn get_metric_collection_settings(
> +        &self,
> +    ) -> Result<pdm_api_types::CollectionSettings, Error> {
> +        let path = "/api2/extjs/config/metric-collection/default";
> +        Ok(self.0.get(path).await?.expect_json()?.data)
> +    }
> +
> +    /// Update global metric collection settings.
> +    pub async fn update_metric_collection_settings(
> +        &self,
> +        updater: pdm_api_types::CollectionSettingsUpdater,
> +        delete: Option<Vec<pdm_api_types::DeletableCollectionSettingsProperty>>,
> +    ) -> Result<(), Error> {
> +        let path = "/api2/extjs/config/metric-collection/default";
> +
> +        #[derive(Serialize)]
> +        struct UpdateSettings<'a> {
> +            updater: &'a pdm_api_types::CollectionSettingsUpdater,
> +            #[serde(skip_serializing_if = "Option::is_none")]
> +            delete: Option<Vec<pdm_api_types::DeletableCollectionSettingsProperty>>,
> +        }
> +
> +        let params = UpdateSettings {
> +            updater: &updater,
> +            delete,
> +        };
> +        self.0.put(path, &params).await?.nodata()?;
> +        Ok(())
> +    }
> +
> +    /// Trigger metric collection for a single remote or all, if no remote is provided.
> +    pub async fn trigger_metric_collection(
> +        &self,
> +        remote: Option<&str>,
> +    ) -> Result<(), proxmox_client::Error> {
> +        let path = "/api2/extjs/metric-collection/trigger";
> +
> +        #[derive(Serialize)]
> +        struct TriggerParams<'a> {
> +            #[serde(skip_serializing_if = "Option::is_none")]
> +            remote: Option<&'a str>,
> +        }
> +
> +        self.0
> +            .post(path, &TriggerParams { remote })
> +            .await?
> +            .expect_json::<()>()?;

Look like this should be `.nodata()`?

> +
> +        Ok(())
> +    }
> +
> +    /// Get global metric collection status.
> +    pub async fn get_metric_collection_status(
> +        &self,
> +    ) -> Result<Vec<pdm_api_types::MetricCollectionStatus>, Error> {
> +        let path = "/api2/extjs/metric-collection/status";
> +        Ok(self.0.get(path).await?.expect_json()?.data)
> +    }
> +
> +    /// Get global metric collection status.
> +    pub async fn get_metric_collection_rrddata(
> +        &self,
> +        mode: RrdMode,
> +        timeframe: RrdTimeframe,
> +    ) -> Result<pdm_api_types::rrddata::FullCollectionDatapoint, Error> {
> +        let mut path = "/api2/extjs/metric-collection/rrddata".to_string();
> +        let mut sep = '?';
> +        add_query_arg(&mut path, &mut sep, "cf", &Some(mode));
> +        add_query_arg(&mut path, &mut sep, "timeframe", &Some(timeframe));
> +        Ok(self.0.get(&path).await?.expect_json()?.data)
> +    }
> +
> +    /// Get per-remote metric collection status.
> +    pub async fn get_per_remote_metric_collection_rrddata(
> +        &self,
> +        remote: &str,
> +        mode: RrdMode,
> +        timeframe: RrdTimeframe,
> +    ) -> Result<pdm_api_types::rrddata::RemoteCollectionDatapoint, Error> {
> +        let mut path = format!("/api2/extjs/remotes/{remote}/metric-collection-rrddata");
> +        let mut sep = '?';
> +        add_query_arg(&mut path, &mut sep, "cf", &Some(mode));
> +        add_query_arg(&mut path, &mut sep, "timeframe", &Some(timeframe));
> +        Ok(self.0.get(&path).await?.expect_json()?.data)
> +    }
> +
>      pub async fn pve_list_nodes(
>          &self,
>          remote: &str,
> -- 
> 2.39.5


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 17/25] metric collection: save time needed for collection run to RRD
  2025-02-13 11:53   ` Wolfgang Bumiller
@ 2025-02-13 12:12     ` Lukas Wagner
  0 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 12:12 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel



On  2025-02-13 12:53, Wolfgang Bumiller wrote:
> On Tue, Feb 11, 2025 at 01:05:33PM +0100, Lukas Wagner wrote:
>> For large setups, it might be useful to know how much time was needed to
>> collect metrics for *all* remotes together, e.g. for making sure that
>> the collection interval is not exceeded.
>>
>> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
>> ---
>>  .../src/metric_collection/collection_task.rs  | 15 ++++++
>>  server/src/metric_collection/rrd_task.rs      | 50 +++++++++++++++----
>>  2 files changed, 56 insertions(+), 9 deletions(-)
>>
>> diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
>> index 27a9987..501df9b 100644
>> --- a/server/src/metric_collection/collection_task.rs
>> +++ b/server/src/metric_collection/collection_task.rs
>> @@ -22,6 +22,7 @@ use pdm_api_types::{
>>  };
>>  use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
>>  
>> +use crate::metric_collection::rrd_task::CollectionStats;
>>  use crate::{connection, task_utils};
>>  
>>  use super::{
>> @@ -97,6 +98,20 @@ impl MetricCollectionTask {
>>                      if let Some(remotes) = Self::load_remote_config() {
>>                          let to_fetch = remotes.order.as_slice();
>>                          self.fetch_remotes(&remotes, to_fetch).await;
>> +
>> +                        let now = Instant::now();
>> +                        self.fetch_remotes(&remotes, &remotes.order).await;
> 
> 2 `fetch_remotes` calls now look a bit weird ;-)
> should probably fold in some later change?
> 

Uhhh, yeah, I guess thats a mishap from a conflict during rebasing. I'll fix it, thanks.

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status Lukas Wagner
@ 2025-02-13 12:14   ` Wolfgang Bumiller
  2025-02-13 14:17     ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-13 12:14 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Tue, Feb 11, 2025 at 01:05:41PM +0100, Lukas Wagner wrote:
> This adds new commands to the proxmox-datacenter-client CLI tool, namely
>  - to update metric collection settings
>  - trigger metric collection
>  - show status of the last metric collection
> 
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
>  cli/client/Cargo.toml               |   1 +
>  cli/client/src/main.rs              |   2 +
>  cli/client/src/metric_collection.rs | 164 ++++++++++++++++++++++++++++
>  3 files changed, 167 insertions(+)
>  create mode 100644 cli/client/src/metric_collection.rs
> 
> diff --git a/cli/client/Cargo.toml b/cli/client/Cargo.toml
> index 96016cd..5c96b28 100644
> --- a/cli/client/Cargo.toml
> +++ b/cli/client/Cargo.toml
> @@ -46,6 +46,7 @@ proxmox-rrd-api-types.workspace = true
>  proxmox-schema = { workspace = true, features = [ "api-macro" ] }
>  proxmox-section-config.workspace = true
>  proxmox-sys.workspace = true
> +proxmox-time.workspace = true
>  proxmox-tfa = { workspace = true, features = [ "types" ] }
>  
>  # for the pve API types:
> diff --git a/cli/client/src/main.rs b/cli/client/src/main.rs
> index c913e97..55877cf 100644
> --- a/cli/client/src/main.rs
> +++ b/cli/client/src/main.rs
> @@ -17,6 +17,7 @@ pub mod env;
>  
>  pub mod acl;
>  pub mod config;
> +pub mod metric_collection;
>  pub mod pbs;
>  pub mod pve;
>  pub mod remotes;
> @@ -93,6 +94,7 @@ fn main_do() -> Result<(), Error> {
>          )
>          .insert("acl", acl::cli())
>          .insert("login", CliCommand::new(&API_METHOD_LOGIN))
> +        .insert("metric-collection", metric_collection::cli())

maaaaaybe...
    .alias(&["mc"], &["metric-collection"])
? ðŸĪŠ

>          .insert("pbs", pbs::cli())
>          .insert("pve", pve::cli())
>          .insert("remote", remotes::cli())
> diff --git a/cli/client/src/metric_collection.rs b/cli/client/src/metric_collection.rs
> new file mode 100644
> index 0000000..8ca861a
> --- /dev/null
> +++ b/cli/client/src/metric_collection.rs
> @@ -0,0 +1,164 @@
> +use anyhow::Error;
> +use pdm_api_types::{
> +    remotes::REMOTE_ID_SCHEMA, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
> +};
> +use proxmox_router::cli::{
> +    format_and_print_result, CliCommand, CliCommandMap, CommandLineInterface, OutputFormat,
> +};
> +use proxmox_schema::api;
> +
> +use crate::{client, env};
> +
> +pub fn cli() -> CommandLineInterface {
> +    CliCommandMap::new()
> +        .insert(
> +            "show-settings",
> +            CliCommand::new(&API_METHOD_SHOW_COLLECTION_SETTINGS),
> +        )

↑↓ wouldn't `settings show`, `settings update` be more consistent?

> +        .insert(
> +            "update-settings",
> +            CliCommand::new(&API_METHOD_UPDATE_COLLECTION_SETTINGS),
> +        )
> +        .insert(
> +            "trigger",
> +            CliCommand::new(&API_METHOD_TRIGGER_METRIC_COLLECTION),
> +        )
> +        .insert(
> +            "status",
> +            CliCommand::new(&API_METHOD_METRIC_COLLECTION_STATUS),
> +        )
> +        .into()
> +}
> +
> +#[api]
> +/// Show metric collection settings.
> +async fn show_collection_settings() -> Result<(), Error> {
> +    let client = client()?;
> +    let settings = client.get_metric_collection_settings().await?;
> +
> +    fn default_if_none<T>(value: Option<T>) -> &'static str {
> +        if value.is_none() {
> +            " (default)"
> +        } else {
> +            ""
> +        }
> +    }
> +
> +    let output_format = env().format_args.output_format;
> +    if output_format == OutputFormat::Text {
> +        println!("Metric collection settings:");
> +        println!(
> +            "    collection-interval: {}s {}",
> +            settings.collection_interval_or_default(),
> +            default_if_none(settings.collection_interval),
> +        );
> +        println!(
> +            "    max-interval-offset: {}s {}",
> +            settings.max_interval_offset_or_default(),
> +            default_if_none(settings.max_interval_offset)
> +        );
> +        println!(
> +            "    min-interval-offset: {}s {}",
> +            settings.min_interval_offset_or_default(),
> +            default_if_none(settings.min_interval_offset)
> +        );
> +        println!(
> +            "    max-connection-delay: {}ms {}",
> +            settings.max_connection_delay_or_default(),
> +            default_if_none(settings.max_connection_delay)
> +        );
> +        println!(
> +            "    min-connection-delay: {}ms {}",
> +            settings.min_connection_delay_or_default(),
> +            default_if_none(settings.min_connection_delay)
> +        );
> +        println!(
> +            "    max-concurrent-connections: {} {}",
> +            settings.max_concurrent_connections_or_default(),
> +            default_if_none(settings.max_concurrent_connections),
> +        );
> +    } else {
> +        format_and_print_result(&settings, &output_format.to_string());
> +    }
> +    Ok(())
> +}
> +
> +#[api(
> +    input: {
> +        properties: {
> +            updater: {
> +                type: CollectionSettingsUpdater,
> +                flatten: true,
> +            },
> +            delete: {
> +                description: "Clear/reset collection settings properties.",
> +                optional: true,
> +                type: Array,
> +                items: {
> +                    type: DeletableCollectionSettingsProperty,
> +                },
> +            },
> +        }
> +    }
> +)]
> +/// Change metric collection settings.
> +async fn update_collection_settings(
> +    updater: CollectionSettingsUpdater,
> +    delete: Option<Vec<DeletableCollectionSettingsProperty>>,
> +) -> Result<(), Error> {
> +    let client = client()?;
> +    client
> +        .update_metric_collection_settings(updater, delete)
> +        .await?;
> +
> +    Ok(())
> +}
> +
> +#[api(
> +    input: {
> +        properties: {
> +            remote: {
> +                schema: REMOTE_ID_SCHEMA,
> +                optional: true,
> +            },
> +        }
> +    }
> +)]
> +/// Trigger metric collection. If a remote is passed, only this remote will be collected, otherwise
> +/// all.
> +async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
> +    client()?
> +        .trigger_metric_collection(remote.as_deref())
> +        .await?;
> +    Ok(())
> +}
> +
> +#[api]
> +/// Show metric collection status.
> +async fn metric_collection_status() -> Result<(), Error> {
> +    let result = client()?.get_metric_collection_status().await?;
> +
> +    let output_format = env().format_args.output_format;
> +    if output_format == OutputFormat::Text {
> +        for remote_status in result {
> +            let timestamp = if let Some(last_collection) = remote_status.last_collection {
> +                proxmox_time::strftime_local("%a, %d %b %Y %T %z", last_collection)?
> +            } else {
> +                "never".into()
> +            };
> +
> +            let status = if let Some(err) = &remote_status.error {
> +                err
> +            } else {
> +                "ok"
> +            };
> +
> +            println!("{}: {status}", remote_status.remote);
> +            println!("    last successful: {timestamp}");
> +            println!();
> +        }
> +    } else {
> +        format_and_print_result(&result, &output_format.to_string());
> +    }
> +    Ok(())
> +}
> -- 
> 2.39.5


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 19/25] api: add endpoint for updating metric collection settings
  2025-02-13 12:09   ` Wolfgang Bumiller
@ 2025-02-13 12:15     ` Lukas Wagner
  0 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 12:15 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel



On  2025-02-13 13:09, Wolfgang Bumiller wrote:
> On Tue, Feb 11, 2025 at 01:05:35PM +0100, Lukas Wagner wrote:
>> This one lives under /config/metric-collection.
>> At the moment, we do not offer full CRUD, but only offer a hard-coded
>> collection settings instance at /config/metric-collection/default, which
>> can be retrieved via GET and updated via PUT.
>>
>> Later, we might update this to full CRUD, e.g. to have different
>> settings for different 'poll-groups' consisting of a sub-set of remotes.
>>
>> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
>> ---
>>  server/src/api/config/metric_collection.rs | 166 +++++++++++++++++++++
>>  server/src/api/config/mod.rs               |   2 +
>>  2 files changed, 168 insertions(+)
>>  create mode 100644 server/src/api/config/metric_collection.rs
>>
>> diff --git a/server/src/api/config/metric_collection.rs b/server/src/api/config/metric_collection.rs
>> new file mode 100644
>> index 0000000..1146101
>> --- /dev/null
>> +++ b/server/src/api/config/metric_collection.rs
>> @@ -0,0 +1,166 @@
>> +use anyhow::{bail, Error};
>> +
>> +use proxmox_config_digest::ConfigDigest;
>> +use proxmox_router::{list_subdirs_api_method, Router, RpcEnvironment, SubdirMap};
>> +use proxmox_schema::api;
>> +use proxmox_sortable_macro::sortable;
>> +
>> +use pdm_api_types::{
>> +    CollectionSettings, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
>> +};
>> +use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
>> +
>> +#[sortable]
>> +const SUBDIRS: SubdirMap = &sorted!([("default", &DEFAULT_ROUTER),]);
>> +
>> +pub const ROUTER: Router = Router::new()
>> +    .get(&list_subdirs_api_method!(SUBDIRS))
>> +    .subdirs(SUBDIRS);
>> +
>> +const DEFAULT_ROUTER: Router = Router::new()
>> +    .get(&API_METHOD_GET_SETTINGS)
>> +    .put(&API_METHOD_UPDATE_SETTINGS);
>> +
>> +#[api(
>> +    protected: false,
>> +)]
>> +/// Get metric collection settings.
>> +pub fn get_settings(rpcenv: &mut dyn RpcEnvironment) -> Result<CollectionSettings, Error> {
>> +    let (config, digest) = pdm_config::metric_collection::config()?;
>> +    rpcenv["digest"] = digest.to_hex().into();
>> +
>> +    if config.sections.contains_key("default") {
>> +        let a = config.lookup(COLLECTION_SETTINGS_TYPE, "default")?;
>> +        Ok(a)
> 
> tiny nit: Clippy would warn here, and 'a' is a weird name anyway ;-)
> (drop the let binding)
>

Already fixed locally for v2, thanks.

Yeah, sometimes 'a' is my 'ugh, not sure yet what to name this yet' or
'ugh, what type is this again, let me see the inline type hints from my LSP' name,
sometimes I forgot to fix that afterwards ;)


-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task
  2025-02-12 15:57   ` Wolfgang Bumiller
@ 2025-02-13 12:31     ` Lukas Wagner
  0 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 12:31 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel

On  2025-02-12 16:57, Wolfgang Bumiller wrote:
> Looks good.
> One improvement/follow-up suggestion (JoinSet), some nitpicking &
> potential refactoring requests...
> 

Thanks for the review :)

I'll incorporate your suggestions in a v2, I already have some other minor changes
queued up for it (typos, punctuation, minor code style improvements that Maximiliano
pointed out to me informally across the desk ;) )


>> +
>> +    /// Run the metric collection task.
>> +    ///
>> +    /// This function does never return.
>> +    #[tracing::instrument(skip_all, name = "metric_collection_task")]
>> +    pub(super) async fn run(&mut self) {
>> +        let mut timer = Self::setup_timer(self.settings.collection_interval_or_default());
>> +
>> +        log::debug!(
>> +            "metric collection starting up. collection interval set to {} seconds",
> 
> ^ nit: using punctuation in the middle but continuing in lower case and
> not ending with punctuation ðŸĪŠ

Good point, fixed for v2.

> 
>> +            self.settings.collection_interval_or_default()
>> +        );
>> +
>> +        loop {
>> +            let old_settings = self.settings.clone();
>> +            tokio::select! {
>> +                _ = timer.tick() => {
>> +                    // Reload settings in case they have changed in the meanwhile
>> +                    self.settings = Self::get_settings_or_default();
>> +
>> +                    log::debug!("starting metric collection from all remotes - triggered by timer");
> 
> Not sure if it's worth moving this into a `self.on_tick()` but...
> 

yeah, I think it makes sense, thanks.

>> +                    self.sleep_for_random_interval_offset().await;
>> +
>> +                    if let Some(remotes) = Self::load_remote_config() {
>> +                        let to_fetch = remotes.order.as_slice();
>> +                        self.fetch_remotes(&remotes, to_fetch).await;
>> +                    }
>> +                }
>> +
>> +                val = self.control_message_rx.recv() => {
>> +                    // Reload settings in case they have changed in the meanwhile
>> +                    self.settings = Self::get_settings_or_default();
>> +                    match val {
> 
> ...but I think this part should be factored out into a
> `self.handle_control_message(msg).await`
> 
> It gets indented quite deeply and it just makes sense IMO :)

Same :)

> 
>> +                        Some(ControlMsg::CollectSingleRemote(remote)) => {
>> +                            if let Some(remotes) = Self::load_remote_config() {
> 
> ^ Also this...
> 
>> +                                log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
>> +                                self.fetch_remotes(&remotes, &[remote]).await;
>> +                            }
>> +                        }
>> +                        Some(ControlMsg::CollectAllRemotes) => {
>> +                            if let Some(remotes) = Self::load_remote_config() {
> 
> ... is the same as this, so if this is factored out into a separate
> function, it could just early-out via a let-else there, as we don't need
> to call it when `val` is `None`, and then it's quite compact.

Ack - refactored it in a separate new commit at the end of the v2 series,
not worth the effort to cleanly fold this into this commit.

> 
>> +                                log::debug!("starting metric collection from all remotes - triggered by control message");
>> +                                self.fetch_remotes(&remotes, &remotes.order).await;
>> +                            }
>> +                        }
>> +                        _ => {},
>> +                    }
>> +                }
>> +            }
>> +
>> +            let interval = self.settings.collection_interval_or_default();
>> +
>> +            if old_settings.collection_interval_or_default() != interval {
>> +                log::info!(
>> +                    "metric collection interval changed to {} seconds, reloading timer",
>> +                    interval
>> +                );
>> +                timer = Self::setup_timer(interval);
>> +            }
>> +        }
>> +    }
>> +
>> +    async fn sleep_for_random_interval_offset(&self) {
>> +        let mut min = self.settings.min_interval_offset_or_default();
>> +        let max = self.settings.max_interval_offset_or_default();
>> +
>> +        if min > max {
>> +            log::warn!(
>> +                "min-interval-offset is larger than max-interval-offset ({min} > {max}) - \
>> +                capping it to max-interval-offset ({max})"
>> +            );
>> +            min = max;
>> +        }
>> +
>> +        let jitter = {
>> +            let mut rng = rand::thread_rng();
>> +            rng.gen_range(min..=max)
>> +        };
>> +        tokio::time::sleep(Duration::from_secs(jitter)).await;
>> +    }
>> +
> 
> ↑ and ↓ are crying out for a common fn with the variable name, min and
> max as parameters ;-)

I think the main reason that deterred me from doing that was the fact that the one function
sleeps for milliseconds and the other one for seconds. But this was also before I added
the 'min' capping logic with the log message, which was added only briefly before sending the patches,
before that the code dedup wasn't really worth it. I've changed it in v2, there we now have a 
    async fn sleep_for_random_millis(param_base: &str, mut min: u64, max: u64)
function with the call sites adapted.

Thanks!

> 
>> +    async fn sleep_for_random_connection_delay(settings: &CollectionSettings) {
>> +        let mut min = settings.min_connection_delay_or_default();
>> +        let max = settings.max_connection_delay_or_default();
>> +
>> +        if min > max {
>> +            log::warn!(
>> +                "min-collection-delay is larger than max-collection-delay ({min} > {max}) - \
>> +                capping it to max-collection-delay ({max})"
>> +            );
>> +            min = max;
>> +        }
>> +
>> +        let jitter = {
>> +            let mut rng = rand::thread_rng();
>> +            rng.gen_range(min..=max)
>> +        };
>> +
>> +        tokio::time::sleep(Duration::from_millis(jitter)).await;
>> +    }
>> +
>> +    fn get_settings_or_default() -> CollectionSettings {
>> +        // This function is a bit odd, but not sure if there is a much nicer
>> +        // way to do it. We want to fall back to defaults if
>> +        //   - the config file does not exist (no errors logged)
>> +        //   - if section type is wrong or the config failed to parse (log errors)
> 
> It gets a bit shorter and if the `_impl` returns a result+option.
> The match below can just `return settings` in the first case, and have
> an `Ok(None) => (),`, the `Err` only has the log line and the default is
> moved to after the match.
> 
> Alternatively a `default()` helper could also shorten it
> (`Ok(default())` in the `_impl` and
> `get_settings_impl().unwrap_or_elese(|err| { log(err); default() })` is
> also shorter :-)

Went for something like this for v2:

        get_settings_impl().unwrap_or_else(|err| {
            log::error!("... {err} ");
            CollectionSettings::new("default")
        })
 

>> +
>> +    /// Fetch metric data from a provided list of remotes concurrently.
>> +    /// The maximum number of concurrent connections is determined by
>> +    /// `max_concurrent_connections` in the [`CollectionSettings`]
>> +    /// instance in `self`.
>> +    async fn fetch_remotes(
>> +        &mut self,
>> +        config: &SectionConfigData<Remote>,
>> +        remotes_to_fetch: &[String],
>> +    ) {
>> +        let semaphore = Arc::new(Semaphore::new(
>> +            self.settings.max_concurrent_connections_or_default(),
>> +        ));
>> +        let mut handles = Vec::new();
> 
> Not sure how "robust" the spawning is wrt. ordering - and this doesn't
> need to happen in this patch (unless it's easy), but a follow-up could
> probably change this into a tokio `JoinSet`, that way the await-loop
> below can use `.join_next()`, so that for example if an early task
> happens to end up finishing last, the loop still collects the remaining
> logs in a more "temporally accurate" way...

Good point, didn't know about `JoinSet`.

I'll do that in a followup or as a separate commit in v2. 
Requires a couple changes in how we do logging/error handling, since
we cannot remember the remotes' name in the handle Vec anymore if we switch to a join set.

> 
>> +
>> +        for remote_name in remotes_to_fetch {
>> +            let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
>> +
>> +            // unwrap is okay here, acquire_* will only fail if `close` has been
>> +            // called on the semaphore.
>> +            let permit = semaphore.clone().acquire_owned().await.unwrap();
> 
> ^ Nit: where possible, I prefer `Arc::clone(&ptr)` just for the sake of
> not accidentally `.clone()`ing something large after later a refactor
> where the clones aren't in the patch context lines.

Thx, I also usually prefer Arc::clone, missed it this time. Fixed for v2.


>>  
>>      tokio::spawn(async move {
>> -        let task_scheduler = pin!(metric_collection_task(tx));
>> +        let metric_collection_task_future = pin!(async move {
>> +            let mut task =
>> +                MetricCollectionTask::new(metric_data_tx, trigger_collection_rx).unwrap();
>> +            task.run().await;
> 
> ^ nit: could drop the let binding (then it IMO linewraps a bit nicer,
> too)

When I was attempting to fix this I spotted the .unwrap() that I left in there and
hence decided to change it into

            match MetricCollectionTask::new(metric_data_tx, trigger_collection_rx) {
                Ok(mut task) => task.run().await,
                Err(err) => log::error!("could not start metric collection task: {err}"),
            }

in v2 to get rid of the unwrap, adding a bit more context to any potential error.


>> +/// Schedule metric collection for a given remote as soon as possible.
>> +///
>> +/// Has no effect if the tx end of the channel has not been initialized yet.
>> +/// Returns an error if the mpsc channel has been closed already.
>> +pub async fn trigger_metric_collection_for_remote(remote: &str) -> Result<(), Error> {
>> +    if let Some(sender) = CONTROL_MESSAGE_TX.get() {
> 
> ^ nit: This is the normal case (given that `CONTROL_MESSAGE_TX` gets
> initialized in `stark_task()` on startup, so IMO the function signature
> could already be `String` to allow the caller to move it (or `impl
> Into<String>` for this to be implicit) - both callers could currently
> move it ;-)
> 

thx, fixed for v2 :)


-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change
  2025-02-13  8:55   ` Wolfgang Bumiller
@ 2025-02-13 13:50     ` Lukas Wagner
  2025-02-13 14:19       ` Wolfgang Bumiller
  0 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 13:50 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel



On  2025-02-13 09:55, Wolfgang Bumiller wrote:
>>          loop {
>>              let old_settings = self.settings.clone();
>>              tokio::select! {
>> @@ -124,7 +132,12 @@ impl MetricCollectionTask {
>>                      "metric collection interval changed to {} seconds, reloading timer",
>>                      interval
>>                  );
>> -                timer = Self::setup_timer(interval);
>> +                (timer, next_run) = Self::setup_timer(interval);
>> +                // If change (and therefore reset) our timer right before it fires,
>> +                // we could potentially miss one collection event.
> 
> Couldn't we instead just pass `next_run` through to `setup_timer` and
> call `reset_at(next_run)` on it? (`first_run` would only be used in the
> initial setup, so `next_run` could either be an `Option`, or the setup
> code does the `next_aligned_instant` call...
> 
> This should be much less code by making the new
> `fetch_overdue{,_and_save_sate}()` functions unnecessary, or am I
> missing something?
> 

I guess the question is, do we want nicely aligned timer ticks?

e.g. 14:01:00, 14:02:00, 14:03:00 ... for 60 second interval
or   14:00:00, 14:05:00, 14:10:00 ... for a 5 minute interval?

Because that was the main intention behind using the 'collection-interval' as
a base for calculating the aligned instant for the first timer reset.
If we reuse the 'old' `next_run` when the interval is changed, we
also reuse the old alignment. 

For instance, when changing from initially 1 minute to 5 minutes, the
timer ticks might come at 
  14:01:00, 14:06:00, 14:11:00

Technically, the naming for the `next_run` variable is not the best,
since it just contains the Instant when the timer *first* fires, but
this is then never updated to the *next* time the timer will fire...
So that means that when changing the interval with your suggested change,
you'd pass an Instant to `reset_at` that is already in the past,
causing the timer to fire immediately.

If we *don't* care about the aligned ticks as described above, we could
just use a static alignment boundary, e.g. 60 seconds.
In this case we can also get rid of the fetch_overdue stuff, since
at worst case we have 60 seconds until the next tick on startup or timer change,
which should be good enough to prevent any significant gaps in the data.


-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 24/25] pdm-client: add metric collection API methods
  2025-02-13 12:10   ` Wolfgang Bumiller
@ 2025-02-13 13:52     ` Lukas Wagner
  0 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 13:52 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel

On  2025-02-13 13:10, Wolfgang Bumiller wrote:
>> +    /// Trigger metric collection for a single remote or all, if no remote is provided.
>> +    pub async fn trigger_metric_collection(
>> +        &self,
>> +        remote: Option<&str>,
>> +    ) -> Result<(), proxmox_client::Error> {
>> +        let path = "/api2/extjs/metric-collection/trigger";
>> +
>> +        #[derive(Serialize)]
>> +        struct TriggerParams<'a> {
>> +            #[serde(skip_serializing_if = "Option::is_none")]
>> +            remote: Option<&'a str>,
>> +        }
>> +
>> +        self.0
>> +            .post(path, &TriggerParams { remote })
>> +            .await?
>> +            .expect_json::<()>()?;
> 
> Look like this should be `.nodata()`?
> 

Thx, fixed for v2.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status
  2025-02-13 12:14   ` Wolfgang Bumiller
@ 2025-02-13 14:17     ` Lukas Wagner
  2025-02-13 14:56       ` Wolfgang Bumiller
  0 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 14:17 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel



On  2025-02-13 13:14, Wolfgang Bumiller wrote:
> On Tue, Feb 11, 2025 at 01:05:41PM +0100, Lukas Wagner wrote:
>> This adds new commands to the proxmox-datacenter-client CLI tool, namely
>>  - to update metric collection settings
>>  - trigger metric collection
>>  - show status of the last metric collection
>>
>> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
>> ---
>>  cli/client/Cargo.toml               |   1 +
>>  cli/client/src/main.rs              |   2 +
>>  cli/client/src/metric_collection.rs | 164 ++++++++++++++++++++++++++++
>>  3 files changed, 167 insertions(+)
>>  create mode 100644 cli/client/src/metric_collection.rs
>>
>> diff --git a/cli/client/Cargo.toml b/cli/client/Cargo.toml
>> index 96016cd..5c96b28 100644
>> --- a/cli/client/Cargo.toml
>> +++ b/cli/client/Cargo.toml
>> @@ -46,6 +46,7 @@ proxmox-rrd-api-types.workspace = true
>>  proxmox-schema = { workspace = true, features = [ "api-macro" ] }
>>  proxmox-section-config.workspace = true
>>  proxmox-sys.workspace = true
>> +proxmox-time.workspace = true
>>  proxmox-tfa = { workspace = true, features = [ "types" ] }
>>  
>>  # for the pve API types:
>> diff --git a/cli/client/src/main.rs b/cli/client/src/main.rs
>> index c913e97..55877cf 100644
>> --- a/cli/client/src/main.rs
>> +++ b/cli/client/src/main.rs
>> @@ -17,6 +17,7 @@ pub mod env;
>>  
>>  pub mod acl;
>>  pub mod config;
>> +pub mod metric_collection;
>>  pub mod pbs;
>>  pub mod pve;
>>  pub mod remotes;
>> @@ -93,6 +94,7 @@ fn main_do() -> Result<(), Error> {
>>          )
>>          .insert("acl", acl::cli())
>>          .insert("login", CliCommand::new(&API_METHOD_LOGIN))
>> +        .insert("metric-collection", metric_collection::cli())
> 
> maaaaaybe...
>     .alias(&["mc"], &["metric-collection"])
> ? ðŸĪŠ
> 

Good idea, didn't know about 'alias'. :)

Although I really couldn't make it work after trying around for a bit, I might save that for a
followup.

>>          .insert("pbs", pbs::cli())
>>          .insert("pve", pve::cli())
>>          .insert("remote", remotes::cli())
>> diff --git a/cli/client/src/metric_collection.rs b/cli/client/src/metric_collection.rs
>> new file mode 100644
>> index 0000000..8ca861a
>> --- /dev/null
>> +++ b/cli/client/src/metric_collection.rs
>> @@ -0,0 +1,164 @@
>> +use anyhow::Error;
>> +use pdm_api_types::{
>> +    remotes::REMOTE_ID_SCHEMA, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
>> +};
>> +use proxmox_router::cli::{
>> +    format_and_print_result, CliCommand, CliCommandMap, CommandLineInterface, OutputFormat,
>> +};
>> +use proxmox_schema::api;
>> +
>> +use crate::{client, env};
>> +
>> +pub fn cli() -> CommandLineInterface {
>> +    CliCommandMap::new()
>> +        .insert(
>> +            "show-settings",
>> +            CliCommand::new(&API_METHOD_SHOW_COLLECTION_SETTINGS),
>> +        )
> 
> ↑↓ wouldn't `settings show`, `settings update` be more consistent?
> 
>> +        .insert(
>> +            "update-settings",
>> +            CliCommand::new(&API_METHOD_UPDATE_COLLECTION_SETTINGS),
>> +        )

I see your point, but then a 'show' next to a 'status' also felt a bit weird, that
was the reason why I added the suffix to make it more clear. I can change it back
if you prefer.

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change
  2025-02-13 13:50     ` Lukas Wagner
@ 2025-02-13 14:19       ` Wolfgang Bumiller
  2025-02-13 15:21         ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-13 14:19 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Thu, Feb 13, 2025 at 02:50:32PM +0100, Lukas Wagner wrote:
> 
> 
> On  2025-02-13 09:55, Wolfgang Bumiller wrote:
> >>          loop {
> >>              let old_settings = self.settings.clone();
> >>              tokio::select! {
> >> @@ -124,7 +132,12 @@ impl MetricCollectionTask {
> >>                      "metric collection interval changed to {} seconds, reloading timer",
> >>                      interval
> >>                  );
> >> -                timer = Self::setup_timer(interval);
> >> +                (timer, next_run) = Self::setup_timer(interval);
> >> +                // If change (and therefore reset) our timer right before it fires,
> >> +                // we could potentially miss one collection event.
> > 
> > Couldn't we instead just pass `next_run` through to `setup_timer` and
> > call `reset_at(next_run)` on it? (`first_run` would only be used in the
> > initial setup, so `next_run` could either be an `Option`, or the setup
> > code does the `next_aligned_instant` call...
> > 
> > This should be much less code by making the new
> > `fetch_overdue{,_and_save_sate}()` functions unnecessary, or am I
> > missing something?
> > 
> 
> I guess the question is, do we want nicely aligned timer ticks?
> 
> e.g. 14:01:00, 14:02:00, 14:03:00 ... for 60 second interval
> or   14:00:00, 14:05:00, 14:10:00 ... for a 5 minute interval?
> 
> Because that was the main intention behind using the 'collection-interval' as
> a base for calculating the aligned instant for the first timer reset.
> If we reuse the 'old' `next_run` when the interval is changed, we
> also reuse the old alignment. 
> 
> For instance, when changing from initially 1 minute to 5 minutes, the
> timer ticks might come at 
>   14:01:00, 14:06:00, 14:11:00
> 
> Technically, the naming for the `next_run` variable is not the best,
> since it just contains the Instant when the timer *first* fires, but
> this is then never updated to the *next* time the timer will fire...
> So that means that when changing the interval with your suggested change,
> you'd pass an Instant to `reset_at` that is already in the past,
> causing the timer to fire immediately.
> 
> If we *don't* care about the aligned ticks as described above, we could
> just use a static alignment boundary, e.g. 60 seconds.
> In this case we can also get rid of the fetch_overdue stuff, since
> at worst case we have 60 seconds until the next tick on startup or timer change,
> which should be good enough to prevent any significant gaps in the data.

What about setting a flag - if the current next tick was earlier than
the new next tick - to tell tick() to re-align the timer when it is next
triggered?

So when going from 1 to 5 minutes at 14:01:50, we `.reset_at(14:02:00)`
and also set `realign = true`, and at 14:02, tick() should
`.reset_at(14:05:00)`.

I just feel like the logic in the "fetch_overdue" code should not be
necessary to have, but if it's too awkward to handle via the tick timer,
it's fine to keep it in v2.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status
  2025-02-13 14:17     ` Lukas Wagner
@ 2025-02-13 14:56       ` Wolfgang Bumiller
  2025-02-13 14:58         ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-13 14:56 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Thu, Feb 13, 2025 at 03:17:03PM +0100, Lukas Wagner wrote:
> 
> 
> On  2025-02-13 13:14, Wolfgang Bumiller wrote:
> > On Tue, Feb 11, 2025 at 01:05:41PM +0100, Lukas Wagner wrote:
> >> This adds new commands to the proxmox-datacenter-client CLI tool, namely
> >>  - to update metric collection settings
> >>  - trigger metric collection
> >>  - show status of the last metric collection
> >>
> >> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> >> ---
> >>  cli/client/Cargo.toml               |   1 +
> >>  cli/client/src/main.rs              |   2 +
> >>  cli/client/src/metric_collection.rs | 164 ++++++++++++++++++++++++++++
> >>  3 files changed, 167 insertions(+)
> >>  create mode 100644 cli/client/src/metric_collection.rs
> >>
> >> diff --git a/cli/client/Cargo.toml b/cli/client/Cargo.toml
> >> index 96016cd..5c96b28 100644
> >> --- a/cli/client/Cargo.toml
> >> +++ b/cli/client/Cargo.toml
> >> @@ -46,6 +46,7 @@ proxmox-rrd-api-types.workspace = true
> >>  proxmox-schema = { workspace = true, features = [ "api-macro" ] }
> >>  proxmox-section-config.workspace = true
> >>  proxmox-sys.workspace = true
> >> +proxmox-time.workspace = true
> >>  proxmox-tfa = { workspace = true, features = [ "types" ] }
> >>  
> >>  # for the pve API types:
> >> diff --git a/cli/client/src/main.rs b/cli/client/src/main.rs
> >> index c913e97..55877cf 100644
> >> --- a/cli/client/src/main.rs
> >> +++ b/cli/client/src/main.rs
> >> @@ -17,6 +17,7 @@ pub mod env;
> >>  
> >>  pub mod acl;
> >>  pub mod config;
> >> +pub mod metric_collection;
> >>  pub mod pbs;
> >>  pub mod pve;
> >>  pub mod remotes;
> >> @@ -93,6 +94,7 @@ fn main_do() -> Result<(), Error> {
> >>          )
> >>          .insert("acl", acl::cli())
> >>          .insert("login", CliCommand::new(&API_METHOD_LOGIN))
> >> +        .insert("metric-collection", metric_collection::cli())
> > 
> > maaaaaybe...
> >     .alias(&["mc"], &["metric-collection"])
> > ? ðŸĪŠ
> > 
> 
> Good idea, didn't know about 'alias'. :)
> 
> Although I really couldn't make it work after trying around for a bit, I might save that for a
> followup.

Okay. PDM uses a newer cli parsing logic which might miss this...

> 
> >>          .insert("pbs", pbs::cli())
> >>          .insert("pve", pve::cli())
> >>          .insert("remote", remotes::cli())
> >> diff --git a/cli/client/src/metric_collection.rs b/cli/client/src/metric_collection.rs
> >> new file mode 100644
> >> index 0000000..8ca861a
> >> --- /dev/null
> >> +++ b/cli/client/src/metric_collection.rs
> >> @@ -0,0 +1,164 @@
> >> +use anyhow::Error;
> >> +use pdm_api_types::{
> >> +    remotes::REMOTE_ID_SCHEMA, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
> >> +};
> >> +use proxmox_router::cli::{
> >> +    format_and_print_result, CliCommand, CliCommandMap, CommandLineInterface, OutputFormat,
> >> +};
> >> +use proxmox_schema::api;
> >> +
> >> +use crate::{client, env};
> >> +
> >> +pub fn cli() -> CommandLineInterface {
> >> +    CliCommandMap::new()
> >> +        .insert(
> >> +            "show-settings",
> >> +            CliCommand::new(&API_METHOD_SHOW_COLLECTION_SETTINGS),
> >> +        )
> > 
> > ↑↓ wouldn't `settings show`, `settings update` be more consistent?
> > 
> >> +        .insert(
> >> +            "update-settings",
> >> +            CliCommand::new(&API_METHOD_UPDATE_COLLECTION_SETTINGS),
> >> +        )
> 
> I see your point, but then a 'show' next to a 'status' also felt a bit weird, that
> was the reason why I added the suffix to make it more clear. I can change it back
> if you prefer.

Well, it's a "settings show", not a "show settings", next to the
"status", as in, the show is on a different level, or what do you mean?


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status
  2025-02-13 14:56       ` Wolfgang Bumiller
@ 2025-02-13 14:58         ` Lukas Wagner
  2025-02-13 15:11           ` Lukas Wagner
  0 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 14:58 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel



On  2025-02-13 15:56, Wolfgang Bumiller wrote:
>>>> +
>>>> +pub fn cli() -> CommandLineInterface {
>>>> +    CliCommandMap::new()
>>>> +        .insert(
>>>> +            "show-settings",
>>>> +            CliCommand::new(&API_METHOD_SHOW_COLLECTION_SETTINGS),
>>>> +        )
>>>
>>> ↑↓ wouldn't `settings show`, `settings update` be more consistent?
>>>
>>>> +        .insert(
>>>> +            "update-settings",
>>>> +            CliCommand::new(&API_METHOD_UPDATE_COLLECTION_SETTINGS),
>>>> +        )
>>
>> I see your point, but then a 'show' next to a 'status' also felt a bit weird, that
>> was the reason why I added the suffix to make it more clear. I can change it back
>> if you prefer.
> 
> Well, it's a "settings show", not a "show settings", next to the
> "status", as in, the show is on a different level, or what do you mean?

Oh, right, that's better. I'll do it that way then.

Thx!

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status
  2025-02-13 14:58         ` Lukas Wagner
@ 2025-02-13 15:11           ` Lukas Wagner
  0 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 15:11 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel



On  2025-02-13 15:58, Lukas Wagner wrote:
> On  2025-02-13 15:56, Wolfgang Bumiller wrote:
>>>>> +
>>>>> +pub fn cli() -> CommandLineInterface {
>>>>> +    CliCommandMap::new()
>>>>> +        .insert(
>>>>> +            "show-settings",
>>>>> +            CliCommand::new(&API_METHOD_SHOW_COLLECTION_SETTINGS),
>>>>> +        )
>>>>
>>>> ↑↓ wouldn't `settings show`, `settings update` be more consistent?
>>>>
>>>>> +        .insert(
>>>>> +            "update-settings",
>>>>> +            CliCommand::new(&API_METHOD_UPDATE_COLLECTION_SETTINGS),
>>>>> +        )
>>>
>>> I see your point, but then a 'show' next to a 'status' also felt a bit weird, that
>>> was the reason why I added the suffix to make it more clear. I can change it back
>>> if you prefer.
>>
>> Well, it's a "settings show", not a "show settings", next to the
>> "status", as in, the show is on a different level, or what do you mean?
> 
> Oh, right, that's better. I'll do it that way then.
> 

To avoid any confusion, originally I just had
 proxmox-datacenter-client metric-collection status
 proxmox-datacenter-client metric-collection show
 proxmox-datacenter-client metric-collection update

and I missed the 'settings' prefix that you put in your suggestion,
I thought you suggested the same thing I already had in an earlier version.

So what I will do in v2 now is a 

 proxmox-datacenter-client metric-collection status
 proxmox-datacenter-client metric-collection settings show
 proxmox-datacenter-client metric-collection settings update

Sorry for the confusion and noise.

-- 
- Lukas




_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change
  2025-02-13 14:19       ` Wolfgang Bumiller
@ 2025-02-13 15:21         ` Lukas Wagner
  2025-02-13 15:34           ` Wolfgang Bumiller
  0 siblings, 1 reply; 49+ messages in thread
From: Lukas Wagner @ 2025-02-13 15:21 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pdm-devel



On  2025-02-13 15:19, Wolfgang Bumiller wrote:
> On Thu, Feb 13, 2025 at 02:50:32PM +0100, Lukas Wagner wrote:
>>
>>
>> On  2025-02-13 09:55, Wolfgang Bumiller wrote:
>>>>          loop {
>>>>              let old_settings = self.settings.clone();
>>>>              tokio::select! {
>>>> @@ -124,7 +132,12 @@ impl MetricCollectionTask {
>>>>                      "metric collection interval changed to {} seconds, reloading timer",
>>>>                      interval
>>>>                  );
>>>> -                timer = Self::setup_timer(interval);
>>>> +                (timer, next_run) = Self::setup_timer(interval);
>>>> +                // If change (and therefore reset) our timer right before it fires,
>>>> +                // we could potentially miss one collection event.
>>>
>>> Couldn't we instead just pass `next_run` through to `setup_timer` and
>>> call `reset_at(next_run)` on it? (`first_run` would only be used in the
>>> initial setup, so `next_run` could either be an `Option`, or the setup
>>> code does the `next_aligned_instant` call...
>>>
>>> This should be much less code by making the new
>>> `fetch_overdue{,_and_save_sate}()` functions unnecessary, or am I
>>> missing something?
>>>
>>
>> I guess the question is, do we want nicely aligned timer ticks?
>>
>> e.g. 14:01:00, 14:02:00, 14:03:00 ... for 60 second interval
>> or   14:00:00, 14:05:00, 14:10:00 ... for a 5 minute interval?
>>
>> Because that was the main intention behind using the 'collection-interval' as
>> a base for calculating the aligned instant for the first timer reset.
>> If we reuse the 'old' `next_run` when the interval is changed, we
>> also reuse the old alignment. 
>>
>> For instance, when changing from initially 1 minute to 5 minutes, the
>> timer ticks might come at 
>>   14:01:00, 14:06:00, 14:11:00
>>
>> Technically, the naming for the `next_run` variable is not the best,
>> since it just contains the Instant when the timer *first* fires, but
>> this is then never updated to the *next* time the timer will fire...
>> So that means that when changing the interval with your suggested change,
>> you'd pass an Instant to `reset_at` that is already in the past,
>> causing the timer to fire immediately.
>>
>> If we *don't* care about the aligned ticks as described above, we could
>> just use a static alignment boundary, e.g. 60 seconds.
>> In this case we can also get rid of the fetch_overdue stuff, since
>> at worst case we have 60 seconds until the next tick on startup or timer change,
>> which should be good enough to prevent any significant gaps in the data.
> 
> What about setting a flag - if the current next tick was earlier than
> the new next tick - to tell tick() to re-align the timer when it is next
> triggered?

The problem is that tokio::time::Interval doesn't give you a way to query when
the next expected tick will be. We can only approximate it by recalculating
a new aligned instant with the same interval, but I guess this might behave 
unpredictably in edge cases.

> 
> So when going from 1 to 5 minutes at 14:01:50, we `.reset_at(14:02:00)`
> and also set `realign = true`, and at 14:02, tick() should
> `.reset_at(14:05:00)`.
> 
> I just feel like the logic in the "fetch_overdue" code should not be
> necessary to have, but if it's too awkward to handle via the tick timer,
> it's fine to keep it in v2.

fetch_overdue is also called when the daemon starts up. If we align to the collection interval
and the daemon was down for a while, we otherwise might end up with gaps in the data.

Remotes keep metric history for 30 minutes. If PDM is down for, say, 29 minutes and
we are aligning to 15min boundaries, in the worst case we might have to wait for
another 15min to fetch metrics, resulting in a gap.

Of course we could just unconditionally force collection after startup, but I think the
fetch_overdue solution solves this and the timer change issue quite okayish.

In v2 I got rid of the fetch_overdue_and_save_state wrapper by putting the state.save()
that we already had in the main loop at the end of the loop, so it's a bit less code now.
The remaining code is not really that complex, I think I'd prefer to keep it
for now.

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change
  2025-02-13 15:21         ` Lukas Wagner
@ 2025-02-13 15:34           ` Wolfgang Bumiller
  0 siblings, 0 replies; 49+ messages in thread
From: Wolfgang Bumiller @ 2025-02-13 15:34 UTC (permalink / raw)
  To: Lukas Wagner; +Cc: pdm-devel

On Thu, Feb 13, 2025 at 04:21:33PM +0100, Lukas Wagner wrote:
> 
> 
> On  2025-02-13 15:19, Wolfgang Bumiller wrote:
> > On Thu, Feb 13, 2025 at 02:50:32PM +0100, Lukas Wagner wrote:
> >>
> >>
> >> On  2025-02-13 09:55, Wolfgang Bumiller wrote:
> >>>>          loop {
> >>>>              let old_settings = self.settings.clone();
> >>>>              tokio::select! {
> >>>> @@ -124,7 +132,12 @@ impl MetricCollectionTask {
> >>>>                      "metric collection interval changed to {} seconds, reloading timer",
> >>>>                      interval
> >>>>                  );
> >>>> -                timer = Self::setup_timer(interval);
> >>>> +                (timer, next_run) = Self::setup_timer(interval);
> >>>> +                // If change (and therefore reset) our timer right before it fires,
> >>>> +                // we could potentially miss one collection event.
> >>>
> >>> Couldn't we instead just pass `next_run` through to `setup_timer` and
> >>> call `reset_at(next_run)` on it? (`first_run` would only be used in the
> >>> initial setup, so `next_run` could either be an `Option`, or the setup
> >>> code does the `next_aligned_instant` call...
> >>>
> >>> This should be much less code by making the new
> >>> `fetch_overdue{,_and_save_sate}()` functions unnecessary, or am I
> >>> missing something?
> >>>
> >>
> >> I guess the question is, do we want nicely aligned timer ticks?
> >>
> >> e.g. 14:01:00, 14:02:00, 14:03:00 ... for 60 second interval
> >> or   14:00:00, 14:05:00, 14:10:00 ... for a 5 minute interval?
> >>
> >> Because that was the main intention behind using the 'collection-interval' as
> >> a base for calculating the aligned instant for the first timer reset.
> >> If we reuse the 'old' `next_run` when the interval is changed, we
> >> also reuse the old alignment. 
> >>
> >> For instance, when changing from initially 1 minute to 5 minutes, the
> >> timer ticks might come at 
> >>   14:01:00, 14:06:00, 14:11:00
> >>
> >> Technically, the naming for the `next_run` variable is not the best,
> >> since it just contains the Instant when the timer *first* fires, but
> >> this is then never updated to the *next* time the timer will fire...
> >> So that means that when changing the interval with your suggested change,
> >> you'd pass an Instant to `reset_at` that is already in the past,
> >> causing the timer to fire immediately.
> >>
> >> If we *don't* care about the aligned ticks as described above, we could
> >> just use a static alignment boundary, e.g. 60 seconds.
> >> In this case we can also get rid of the fetch_overdue stuff, since
> >> at worst case we have 60 seconds until the next tick on startup or timer change,
> >> which should be good enough to prevent any significant gaps in the data.
> > 
> > What about setting a flag - if the current next tick was earlier than
> > the new next tick - to tell tick() to re-align the timer when it is next
> > triggered?
> 
> The problem is that tokio::time::Interval doesn't give you a way to query when
> the next expected tick will be. We can only approximate it by recalculating
> a new aligned instant with the same interval, but I guess this might behave 
> unpredictably in edge cases.
> 
> > 
> > So when going from 1 to 5 minutes at 14:01:50, we `.reset_at(14:02:00)`
> > and also set `realign = true`, and at 14:02, tick() should
> > `.reset_at(14:05:00)`.
> > 
> > I just feel like the logic in the "fetch_overdue" code should not be
> > necessary to have, but if it's too awkward to handle via the tick timer,
> > it's fine to keep it in v2.
> 
> fetch_overdue is also called when the daemon starts up. If we align to the collection interval
> and the daemon was down for a while, we otherwise might end up with gaps in the data.
> 
> Remotes keep metric history for 30 minutes. If PDM is down for, say, 29 minutes and
> we are aligning to 15min boundaries, in the worst case we might have to wait for
> another 15min to fetch metrics, resulting in a gap.
> 
> Of course we could just unconditionally force collection after startup, but I think the
> fetch_overdue solution solves this and the timer change issue quite okayish.
> 
> In v2 I got rid of the fetch_overdue_and_save_state wrapper by putting the state.save()
> that we already had in the main loop at the end of the loop, so it's a bit less code now.
> The remaining code is not really that complex, I think I'd prefer to keep it
> for now.

Okay.


_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI)
  2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
                   ` (24 preceding siblings ...)
  2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status Lukas Wagner
@ 2025-02-14 13:08 ` Lukas Wagner
  25 siblings, 0 replies; 49+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:08 UTC (permalink / raw)
  To: pdm-devel

Superseded-by: https://lore.proxmox.com/pdm-devel/20250214130653.283012-1-l.wagner@proxmox.com/T/#t

-- 
- Lukas



_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-02-14 13:09 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-11 12:05 [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 01/25] test support: add NamedTempFile helper Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 02/25] test support: add NamedTempDir helper Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 03/25] pdm-api-types: add CollectionSettings type Lukas Wagner
2025-02-11 14:18   ` Maximiliano Sandoval
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 04/25] pdm-config: add functions for reading/writing metric collection settings Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 05/25] metric collection: split top_entities split into separate module Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 06/25] metric collection: save metric data to RRD in separate task Lukas Wagner
2025-02-12 13:59   ` Wolfgang Bumiller
2025-02-12 14:32     ` Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 07/25] metric collection: rework metric poll task Lukas Wagner
2025-02-11 12:58   ` Lukas Wagner
2025-02-12 15:57   ` Wolfgang Bumiller
2025-02-13 12:31     ` Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 08/25] metric collection: persist state after metric collection Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 09/25] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 10/25] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
2025-02-13  8:55   ` Wolfgang Bumiller
2025-02-13 13:50     ` Lukas Wagner
2025-02-13 14:19       ` Wolfgang Bumiller
2025-02-13 15:21         ` Lukas Wagner
2025-02-13 15:34           ` Wolfgang Bumiller
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 11/25] metric collection: add tests for the fetch_remotes function Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 12/25] metric collection: add test for fetch_overdue Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 13/25] metric collection: pass rrd cache instance as function parameter Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 14/25] metric collection: add test for rrd task Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 15/25] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 16/25] metric collection: record remote response time in metric database Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 17/25] metric collection: save time needed for collection run to RRD Lukas Wagner
2025-02-13 11:53   ` Wolfgang Bumiller
2025-02-13 12:12     ` Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 18/25] metric collection: periodically clean removed remotes from statefile Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 19/25] api: add endpoint for updating metric collection settings Lukas Wagner
2025-02-13 12:09   ` Wolfgang Bumiller
2025-02-13 12:15     ` Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 20/25] api: add endpoint to trigger metric collection Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 21/25] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 22/25] api: add api for querying metric collection RRD data Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 23/25] api: metric-collection: add status endpoint Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 24/25] pdm-client: add metric collection API methods Lukas Wagner
2025-02-13 12:10   ` Wolfgang Bumiller
2025-02-13 13:52     ` Lukas Wagner
2025-02-11 12:05 ` [pdm-devel] [PATCH proxmox-datacenter-manager 25/25] cli: add commands for metric-collection settings, trigger, status Lukas Wagner
2025-02-13 12:14   ` Wolfgang Bumiller
2025-02-13 14:17     ` Lukas Wagner
2025-02-13 14:56       ` Wolfgang Bumiller
2025-02-13 14:58         ` Lukas Wagner
2025-02-13 15:11           ` Lukas Wagner
2025-02-14 13:08 ` [pdm-devel] [PATCH proxmox-datacenter-manager 00/25] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal