* [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI)
@ 2025-02-14 13:06 Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 01/28] test support: add NamedTempFile helper Lukas Wagner
` (29 more replies)
0 siblings, 30 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
Key points:
- fetch metrics concurrently
- configuration for metric collection
- new config /etc/proxmox-datacenter-manager/metric-collection.json
- max-concurrency (number of allowed parallel connections)
- collection-interval
- randomized offset for collection start
(min-interval-offset..max-interval-offset)
- randomized per-connection delay
(max-connection-delay..max-connection-delay)
- Add some tests for the core logic in the metric collection system
- Allow to trigger metric collection via the API
- Record metric collection statistics in the RRD
- overall collection time for all remotes
- per remote response time when fetching metrics
- Persist metric collection state to disk:
/var/lib/proxmox-datacenter-manager/metric-collection-state.json
(timestamps of last collection, errors)
- Trigger metric collection for any new remotes added via the API
- Add new API endpoints
POST /metric-collection/trigger with optional 'remote' param
GET /metric-collection/status
GET/PUT /config/metric-collection/default
GET /remotes/<remote>/metric-collection-rrddata
GET /metric-collection/rrddata
- Add CLI tooling
proxmox-datacenter-client metric-collection settings show
proxmox-datacenter-client metric-collection settings update
proxmox-datacenter-client metric-collection trigger [--remote <remote>]
proxmox-datacenter-client metric-collection status
## To reviewers / open questions:
- Please review the defaults I've chosen for the settings, especially
the ones for the default metric collection interval (10 minutes) as
well as max-concurrency (10).
I also kindly ask to double-check the naming of the properties.
See "pdm-api-types: add CollectionSettings type" for details
- Please review path and params for new API endpoints (anything public
facing that is hard to change later)
- I've chosen a section-config config now, even though we only have a
single section for now. This was done for future-proofing reasons,
maybe we want to add support for different setting 'groups' or
something, e.g. to have different settings for distinct sets of
remotes. Does this make sense?
Or should I just stick to a simple config for now? (At moments like
these I wish for TOML configs where we could be a bit more flexible...)
collection-settings: default
max-concurrency 10
collection-interval 180
min-interval-offset 0
max-interval-offset 20
min-connection-delay 10
max-connection-delay 100
- Should `GET /remotes/<remote>/metric-collection-rrddata` be
just `rrddata`?
not sure if we are going to add any other PDM-native per-remote
metrics and whether we want to return that from the same API call
as this...
## Potential future work
- UI button for triggering metric collection
- UI for metric collection settings
- Show RRD graphs for metric collection stats somewhere
## Random offset/delay examples
Example with 'max-concurrency' = 3 and 6 remotes.
X ... timer triggered
[ A ] .... fetching remote 'A'
**** .... interval-offset (usually a couple of seconds)
#### .... random worker delay (usually in millisecond range)
/--########[ B ] ### [ C ]--\
/---####[ A ] ###### [ D ]--------\
----X ************* ---/ ---###### [ E ] #########[ F ]--\----
Changes since [v1]:
- add missing dependency to librust-rand-dev to d/control
- Fix a couple of minor spelling/punctuation issues (thx maximiliano)
- Some minor code style improvments, e.g. using unwrap_or_else instead of doing
a manual match
- Document return values of 'setup_timer' function
- Factor out handle_tick/handle_control_message
- Minor refatoring/code style improvments
- CLI: Change 'update-settings' to 'settings update'
- CLI: Change 'show-settings' to 'settings show'
- change missed tick behavior for tokio::time::Interval to 'skip' instead
of burst.
The last three commits are new in v2.
[v1]: https://lore.proxmox.com/pdm-devel/20250211120541.163621-1-l.wagner@proxmox.com/T/#t
proxmox-datacenter-manager:
Lukas Wagner (28):
test support: add NamedTempFile helper
test support: add NamedTempDir helper
pdm-api-types: add CollectionSettings type
pdm-config: add functions for reading/writing metric collection
settings
metric collection: split top_entities split into separate module
metric collection: save metric data to RRD in separate task
metric collection: rework metric poll task
metric collection: persist state after metric collection
metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
metric collection: collect overdue metrics on startup/timer change
metric collection: add tests for the fetch_remotes function
metric collection: add test for fetch_overdue
metric collection: pass rrd cache instance as function parameter
metric collection: add test for rrd task
metric collection: wrap rrd_cache::Cache in a struct
metric collection: record remote response time in metric database
metric collection: save time needed for collection run to RRD
metric collection: periodically clean removed remotes from statefile
api: add endpoint for updating metric collection settings
api: add endpoint to trigger metric collection
api: remotes: trigger immediate metric collection for newly added
nodes
api: add api for querying metric collection RRD data
api: metric-collection: add status endpoint
pdm-client: add metric collection API methods
cli: add commands for metric-collection settings, trigger, status
metric collection: factor out handle_tick and handle_control_message
fns
metric collection: skip missed timer ticks
metric collection: use JoinSet instead of joining from handles in a
Vec
Cargo.toml | 1 +
cli/client/Cargo.toml | 1 +
cli/client/src/main.rs | 2 +
cli/client/src/metric_collection.rs | 170 ++++
debian/control | 1 +
lib/pdm-api-types/src/lib.rs | 3 +
lib/pdm-api-types/src/metric_collection.rs | 203 +++++
lib/pdm-api-types/src/rrddata.rs | 26 +
lib/pdm-client/src/lib.rs | 87 +++
lib/pdm-config/src/lib.rs | 1 +
lib/pdm-config/src/metric_collection.rs | 69 ++
server/Cargo.toml | 1 +
server/src/api/config/metric_collection.rs | 156 ++++
server/src/api/config/mod.rs | 2 +
server/src/api/metric_collection.rs | 99 +++
server/src/api/mod.rs | 2 +
server/src/api/remotes.rs | 59 ++
server/src/api/resources.rs | 3 +-
server/src/api/rrd_common.rs | 11 +-
server/src/bin/proxmox-datacenter-api.rs | 2 +-
server/src/lib.rs | 2 +-
.../src/metric_collection/collection_task.rs | 739 ++++++++++++++++++
server/src/metric_collection/mod.rs | 332 ++------
server/src/metric_collection/rrd_cache.rs | 204 ++---
server/src/metric_collection/rrd_task.rs | 289 +++++++
server/src/metric_collection/state.rs | 150 ++++
server/src/metric_collection/top_entities.rs | 150 ++++
server/src/test_support/mod.rs | 4 +
server/src/test_support/temp.rs | 60 ++
29 files changed, 2467 insertions(+), 362 deletions(-)
create mode 100644 cli/client/src/metric_collection.rs
create mode 100644 lib/pdm-api-types/src/metric_collection.rs
create mode 100644 lib/pdm-config/src/metric_collection.rs
create mode 100644 server/src/api/config/metric_collection.rs
create mode 100644 server/src/api/metric_collection.rs
create mode 100644 server/src/metric_collection/collection_task.rs
create mode 100644 server/src/metric_collection/rrd_task.rs
create mode 100644 server/src/metric_collection/state.rs
create mode 100644 server/src/metric_collection/top_entities.rs
create mode 100644 server/src/test_support/temp.rs
Summary over all repositories:
29 files changed, 2467 insertions(+), 362 deletions(-)
--
Generated by git-murpp 0.8.0
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 01/28] test support: add NamedTempFile helper
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 02/28] test support: add NamedTempDir helper Lukas Wagner
` (28 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This one is useful when writing tests, it automatically removes the
temporary file when dropped. The name was chosen because of the similar
NamedTempFile struct in the popular tempfile crate.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
Cherry-picked from
https://lore.proxmox.com/pdm-devel/20250128122520.167796-3-l.wagner@proxmox.com/T/#u
server/src/lib.rs | 2 +-
server/src/test_support/mod.rs | 4 ++++
server/src/test_support/temp.rs | 33 +++++++++++++++++++++++++++++++++
3 files changed, 38 insertions(+), 1 deletion(-)
create mode 100644 server/src/test_support/temp.rs
diff --git a/server/src/lib.rs b/server/src/lib.rs
index 12dc912f..143ee32d 100644
--- a/server/src/lib.rs
+++ b/server/src/lib.rs
@@ -13,7 +13,7 @@ pub mod task_utils;
pub mod connection;
pub mod pbs_client;
-#[cfg(remote_config = "faked")]
+#[cfg(any(remote_config = "faked", test))]
pub mod test_support;
use anyhow::Error;
diff --git a/server/src/test_support/mod.rs b/server/src/test_support/mod.rs
index e54cd729..f026011c 100644
--- a/server/src/test_support/mod.rs
+++ b/server/src/test_support/mod.rs
@@ -1 +1,5 @@
+#[cfg(remote_config = "faked")]
pub mod fake_remote;
+
+#[cfg(test)]
+pub mod temp;
diff --git a/server/src/test_support/temp.rs b/server/src/test_support/temp.rs
new file mode 100644
index 00000000..a3a6d59b
--- /dev/null
+++ b/server/src/test_support/temp.rs
@@ -0,0 +1,33 @@
+use std::path::{Path, PathBuf};
+
+use anyhow::Error;
+
+use proxmox_sys::fs::CreateOptions;
+
+/// Temporary file that be cleaned up when dropped.
+pub struct NamedTempFile {
+ path: PathBuf,
+}
+
+impl NamedTempFile {
+ /// Create a new temporary file.
+ ///
+ /// The file will be created with the passed [`CreateOptions`].
+ pub fn new(options: CreateOptions) -> Result<Self, Error> {
+ let base = std::env::temp_dir().join("test");
+ let (_, path) = proxmox_sys::fs::make_tmp_file(base, options)?;
+
+ Ok(Self { path })
+ }
+
+ /// Return the [`Path`] to the temporary file.
+ pub fn path(&self) -> &Path {
+ &self.path
+ }
+}
+
+impl Drop for NamedTempFile {
+ fn drop(&mut self) {
+ let _ = std::fs::remove_file(&self.path);
+ }
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 02/28] test support: add NamedTempDir helper
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 01/28] test support: add NamedTempFile helper Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 03/28] pdm-api-types: add CollectionSettings type Lukas Wagner
` (27 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This one is useful when writing tests, it automatically removes the
temporary directory when dropped.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/test_support/temp.rs | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/server/src/test_support/temp.rs b/server/src/test_support/temp.rs
index a3a6d59b..a93c914d 100644
--- a/server/src/test_support/temp.rs
+++ b/server/src/test_support/temp.rs
@@ -31,3 +31,30 @@ impl Drop for NamedTempFile {
let _ = std::fs::remove_file(&self.path);
}
}
+
+/// Temporary directory that is cleaned up when dropped.
+pub struct NamedTempDir {
+ path: PathBuf,
+}
+
+impl NamedTempDir {
+ /// Create a new temporary directory.
+ ///
+ /// The directory will be created with `0o700` permissions.
+ pub fn new() -> Result<Self, Error> {
+ let path = proxmox_sys::fs::make_tmp_dir("/tmp", None)?;
+
+ Ok(Self { path })
+ }
+
+ /// Return the [`Path`] to the temporary directory.
+ pub fn path(&self) -> &Path {
+ &self.path
+ }
+}
+
+impl Drop for NamedTempDir {
+ fn drop(&mut self) {
+ let _ = std::fs::remove_dir_all(&self.path);
+ }
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 03/28] pdm-api-types: add CollectionSettings type
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 01/28] test support: add NamedTempFile helper Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 02/28] test support: add NamedTempDir helper Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-18 15:26 ` Wolfgang Bumiller
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 04/28] pdm-config: add functions for reading/writing metric collection settings Lukas Wagner
` (26 subsequent siblings)
29 siblings, 1 reply; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This commit adds the CollectionSettings type which holds settings for
the metric collection system. Included are collection interval, max
concurrency and upper/lower bounds for the metric collection loop.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
lib/pdm-api-types/src/lib.rs | 3 +
lib/pdm-api-types/src/metric_collection.rs | 188 +++++++++++++++++++++
2 files changed, 191 insertions(+)
create mode 100644 lib/pdm-api-types/src/metric_collection.rs
diff --git a/lib/pdm-api-types/src/lib.rs b/lib/pdm-api-types/src/lib.rs
index 38449071..6115e41c 100644
--- a/lib/pdm-api-types/src/lib.rs
+++ b/lib/pdm-api-types/src/lib.rs
@@ -19,6 +19,9 @@ pub use acl::*;
mod node_config;
pub use node_config::*;
+mod metric_collection;
+pub use metric_collection::*;
+
mod proxy;
pub use proxy::HTTP_PROXY_SCHEMA;
diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
new file mode 100644
index 00000000..92487d6c
--- /dev/null
+++ b/lib/pdm-api-types/src/metric_collection.rs
@@ -0,0 +1,188 @@
+//! API types for metric collection settings.
+
+use serde::{Deserialize, Serialize};
+
+use proxmox_schema::{api, Updater};
+
+/// Default metric collection interval.
+pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
+
+/// Minimum metric collection interval.
+pub const MIN_COLLECTION_INTERVAL: u64 = 10;
+
+/// Maximum metric collection interval.
+/// PVE and PBS keep 30 minutes of metric history,
+/// maximum is set to 25 minutes to leave some headroom.
+pub const MAX_COLLECTION_INTERVAL: u64 = 1500;
+
+/// Default number of concurrent connections.
+pub const DEFAULT_CONCURRENCY: usize = 10;
+/// Maximum number of concurrent connections.
+pub const MAX_CONCURRENCY: u64 = 50;
+/// Minimum number of concurrent connections (no concurrency).
+pub const MIN_CONCURRENCY: u64 = 1;
+
+/// Default upper bound for the random delay for the
+/// main metric collection loop.
+pub const DEFAULT_MAX_OFFSET: u64 = 10;
+/// Default lower bound for the random delay for the
+/// main metric collection loop.
+pub const DEFAULT_MIN_OFFSET: u64 = 0;
+
+/// Highest configurable upper bound for the random interval offset for the main loop.
+pub const MAX_INTERVAL_OFFSET: u64 = 300;
+/// Lowest configureable lower bound for the random interval offset for the main loop.
+pub const MIN_INTERVAL_OFFSET: u64 = 0;
+
+/// Default upper bound for the random individual connection delay.
+pub const DEFAULT_MAX_CONNECTION_DELAY: u64 = 100;
+/// Default lower bound for the random individual connection delay.
+pub const DEFAULT_MIN_CONNECTION_DELAY: u64 = 0;
+
+/// Highest configurable upper bound for the random individual connection delay.
+pub const MAX_CONNECTION_DELAY: u64 = 1000;
+/// Lowest configurable lower bound for the random individual connection delay.
+pub const MIN_CONNECTION_DELAY: u64 = 0;
+
+#[api(
+ properties: {
+ "collection-interval" : {
+ optional: true,
+ default: DEFAULT_COLLECTION_INTERVAL as isize,
+ minimum: MIN_COLLECTION_INTERVAL as isize,
+ maximum: MAX_COLLECTION_INTERVAL as isize,
+ },
+ "max-concurrent-connections" : {
+ optional: true,
+ default: DEFAULT_CONCURRENCY as isize,
+ minimum: MIN_CONCURRENCY as isize,
+ maximum: MAX_CONCURRENCY as isize,
+ },
+ "max-interval-offset" : {
+ optional: true,
+ default: DEFAULT_MAX_OFFSET as isize,
+ minimum: MIN_INTERVAL_OFFSET as isize,
+ maximum: MAX_INTERVAL_OFFSET as isize,
+ },
+ "min-interval-offset" : {
+ optional: true,
+ default: DEFAULT_MIN_OFFSET as isize,
+ minimum: MIN_INTERVAL_OFFSET as isize,
+ maximum: MAX_INTERVAL_OFFSET as isize,
+ },
+ "max-connection-delay" : {
+ optional: true,
+ default: DEFAULT_MAX_CONNECTION_DELAY as isize,
+ minimum: MIN_CONNECTION_DELAY as isize,
+ maximum: MAX_CONNECTION_DELAY as isize,
+ },
+ "min-connection-delay" : {
+ optional: true,
+ default: DEFAULT_MIN_CONNECTION_DELAY as isize,
+ minimum: MIN_CONNECTION_DELAY as isize,
+ maximum: MAX_CONNECTION_DELAY as isize,
+ },
+ },
+)]
+#[derive(Clone, Default, Deserialize, Serialize, Updater)]
+#[serde(rename_all = "kebab-case")]
+/// Settings for the metric collection system.
+pub struct CollectionSettings {
+ /// Collection settings ID
+ #[updater(skip)]
+ pub id: String,
+
+ /// Interval in seconds at which to collect metrics.
+ /// The point in time at which metrics are collected
+ /// are aligned based on the collection interval. For instance,
+ /// a collection interval of 300 (5 minutes) would schedule metric collection
+ /// at 11:00:00, 11:05:00 (without accounting for the random offset).
+ ///
+ /// To avoid load spikes, metric collection runs are offeset by
+ /// a random number of seconds between min_interval_offset and max_interval_offset.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub collection_interval: Option<u64>,
+
+ /// Maximum number of concurrent connections while collecting metrics.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub max_concurrent_connections: Option<usize>,
+
+ /// Maximum offset in seconds for the metric collection interval.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub max_interval_offset: Option<u64>,
+
+ /// Minimum offset in seconds for the metric collection interval.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub min_interval_offset: Option<u64>,
+
+ /// Maximum random delay in milliseconds for each outgoing connection.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub max_connection_delay: Option<u64>,
+
+ /// Minimum random delay in milliseconds for each outgoing connection.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub min_connection_delay: Option<u64>,
+}
+
+#[api]
+#[derive(Copy, Clone, Deserialize, Serialize)]
+#[serde(rename_all = "kebab-case")]
+/// Deletable property for [`CollectionSettings`].
+pub enum DeletableCollectionSettingsProperty {
+ /// Delete 'collection-interval'.
+ CollectionInterval,
+ /// Delete 'max-concurrent-connections'.
+ MaxConcurrentConnections,
+ /// Delete 'max-interval-offset'.
+ MaxIntervalOffset,
+ /// Delete 'min-interval-offset'.
+ MinIntervalOffset,
+ /// Delete 'min-connection-delay'.
+ MaxConnectionDelay,
+ /// Delete 'min-connection-delay'.
+ MinConnectionDelay,
+}
+
+impl CollectionSettings {
+ /// Create a new settings instance with a given `id`.
+ pub fn new(id: &str) -> Self {
+ Self {
+ id: id.into(),
+ ..Default::default()
+ }
+ }
+
+ /// Return the collection interval or the default if not configured.
+ pub fn collection_interval_or_default(&self) -> u64 {
+ self.collection_interval
+ .unwrap_or(DEFAULT_COLLECTION_INTERVAL)
+ }
+
+ /// Return the number of allowed concurrent connections or the default if not configured.
+ pub fn max_concurrent_connections_or_default(&self) -> usize {
+ self.max_concurrent_connections
+ .unwrap_or(DEFAULT_CONCURRENCY)
+ }
+
+ /// Return the upper bound for the main loop delay or the default if not configured.
+ pub fn max_interval_offset_or_default(&self) -> u64 {
+ self.max_interval_offset.unwrap_or(DEFAULT_MAX_OFFSET)
+ }
+
+ /// Return the lower bound for the main loop delay or the default if not configured.
+ pub fn min_interval_offset_or_default(&self) -> u64 {
+ self.min_interval_offset.unwrap_or(DEFAULT_MIN_OFFSET)
+ }
+
+ /// Return the upper bound for the connection delay or the default if not configured.
+ pub fn max_connection_delay_or_default(&self) -> u64 {
+ self.max_connection_delay
+ .unwrap_or(DEFAULT_MAX_CONNECTION_DELAY)
+ }
+
+ /// Return the lower bound for the connection delay or the default if not configured.
+ pub fn min_connection_delay_or_default(&self) -> u64 {
+ self.min_connection_delay
+ .unwrap_or(DEFAULT_MIN_CONNECTION_DELAY)
+ }
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 04/28] pdm-config: add functions for reading/writing metric collection settings
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (2 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 03/28] pdm-api-types: add CollectionSettings type Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 05/28] metric collection: split top_entities split into separate module Lukas Wagner
` (25 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This commit adds support for reading and writing a new configuration
file at /etc/proxmox-datacenter-manager/metric-collection.cfg.
It is a regular SectionConfig file holding CollectionSettings entries
for now. For starters, there will be only a single entry with the key
'default', e.g.
collection-settings: default
max-concurrency 10
...
It might seem a bit odd to use a section config file for something like
this, but this gives us the ability to expand it more easily in the
future, e.g. different settings for different groups of remotes, other
section types for logical structuring, etc.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
lib/pdm-config/src/lib.rs | 1 +
lib/pdm-config/src/metric_collection.rs | 69 +++++++++++++++++++++++++
2 files changed, 70 insertions(+)
create mode 100644 lib/pdm-config/src/metric_collection.rs
diff --git a/lib/pdm-config/src/lib.rs b/lib/pdm-config/src/lib.rs
index ac398cab..d80ee402 100644
--- a/lib/pdm-config/src/lib.rs
+++ b/lib/pdm-config/src/lib.rs
@@ -5,6 +5,7 @@ pub use pdm_buildcfg::{BACKUP_GROUP_NAME, BACKUP_USER_NAME};
pub mod certificate_config;
pub mod domains;
+pub mod metric_collection;
pub mod node;
pub mod remotes;
pub mod setup;
diff --git a/lib/pdm-config/src/metric_collection.rs b/lib/pdm-config/src/metric_collection.rs
new file mode 100644
index 00000000..1c402700
--- /dev/null
+++ b/lib/pdm-config/src/metric_collection.rs
@@ -0,0 +1,69 @@
+//! Read/write metric collection configuration
+
+use std::sync::OnceLock;
+
+use anyhow::Error;
+
+use proxmox_config_digest::ConfigDigest;
+use proxmox_product_config::{open_api_lockfile, replace_config, ApiLockGuard};
+
+use pdm_api_types::{CollectionSettings, PROXMOX_SAFE_ID_FORMAT};
+
+use pdm_buildcfg::configdir;
+use proxmox_schema::{ApiType, ObjectSchema, Schema, StringSchema};
+use proxmox_section_config::{SectionConfig, SectionConfigData, SectionConfigPlugin};
+
+const METRIC_COLLECTION_CFG_FILENAME: &str = configdir!("/metric-collection.cfg");
+const METRIC_COLLECTION_CFG_LOCKFILE: &str = configdir!("/.metric-collection.lock");
+
+/// The section-config type name for metric collection settings.
+pub const COLLECTION_SETTINGS_TYPE: &str = "collection-settings";
+
+/// Section config schema for the public config file.
+fn config_parser() -> &'static SectionConfig {
+ static CONFIG: OnceLock<SectionConfig> = OnceLock::new();
+ CONFIG.get_or_init(config_init)
+}
+
+const ENTITY_NAME_SCHEMA: Schema = StringSchema::new("collection-settings key.")
+ .format(&PROXMOX_SAFE_ID_FORMAT)
+ .min_length(2)
+ .max_length(32)
+ .schema();
+
+fn config_init() -> SectionConfig {
+ let mut config = SectionConfig::new(&ENTITY_NAME_SCHEMA);
+
+ const MATCHER_SCHEMA: &ObjectSchema = CollectionSettings::API_SCHEMA.unwrap_object_schema();
+ config.register_plugin(SectionConfigPlugin::new(
+ COLLECTION_SETTINGS_TYPE.into(),
+ Some(String::from("id")),
+ MATCHER_SCHEMA,
+ ));
+
+ config
+}
+
+/// Lock the metric-collection config
+pub fn lock_config() -> Result<ApiLockGuard, Error> {
+ open_api_lockfile(METRIC_COLLECTION_CFG_LOCKFILE, None, true)
+}
+
+/// Return contents of the metric-collection config
+pub fn config() -> Result<(SectionConfigData, ConfigDigest), Error> {
+ let content = proxmox_sys::fs::file_read_optional_string(METRIC_COLLECTION_CFG_FILENAME)?
+ .unwrap_or_default();
+
+ let digest = openssl::sha::sha256(content.as_bytes());
+ let data = config_parser().parse(METRIC_COLLECTION_CFG_FILENAME, &content)?;
+
+ Ok((data, digest.into()))
+}
+
+/// Replace the currently persisted metric-collection config
+pub fn save_config(config: &SectionConfigData) -> Result<(), Error> {
+ let content = config_parser().write(METRIC_COLLECTION_CFG_FILENAME, config)?;
+ replace_config(METRIC_COLLECTION_CFG_FILENAME, content.as_bytes())?;
+
+ Ok(())
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 05/28] metric collection: split top_entities split into separate module
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (3 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 04/28] pdm-config: add functions for reading/writing metric collection settings Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 06/28] metric collection: save metric data to RRD in separate task Lukas Wagner
` (24 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This makes the parent module a bit more easy not navigate.
No functional changes intended.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/api/resources.rs | 3 +-
server/src/metric_collection/mod.rs | 149 +------------------
server/src/metric_collection/top_entities.rs | 148 ++++++++++++++++++
3 files changed, 152 insertions(+), 148 deletions(-)
create mode 100644 server/src/metric_collection/top_entities.rs
diff --git a/server/src/api/resources.rs b/server/src/api/resources.rs
index 453d9e8e..afda51e3 100644
--- a/server/src/api/resources.rs
+++ b/server/src/api/resources.rs
@@ -26,6 +26,7 @@ use proxmox_subscription::SubscriptionStatus;
use pve_api_types::{ClusterResource, ClusterResourceType};
use crate::connection;
+use crate::metric_collection::top_entities;
pub const ROUTER: Router = Router::new()
.get(&list_subdirs_api_method!(SUBDIRS))
@@ -320,7 +321,7 @@ async fn get_top_entities(timeframe: Option<RrdTimeframe>) -> Result<TopEntities
let (remotes_config, _) = pdm_config::remotes::config()?;
let timeframe = timeframe.unwrap_or(RrdTimeframe::Day);
- let res = crate::metric_collection::calculate_top(&remotes_config.sections, timeframe, 10);
+ let res = top_entities::calculate_top(&remotes_config.sections, timeframe, 10);
Ok(res)
}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index b37d6782..5540f937 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -6,15 +6,13 @@ use anyhow::Error;
use pbs_api_types::{MetricDataPoint, MetricDataType};
use proxmox_rrd::rrd::DataSourceType;
-use pdm_api_types::{
- remotes::RemoteType,
- resource::{Resource, ResourceRrdData, TopEntities, TopEntity},
-};
+use pdm_api_types::remotes::RemoteType;
use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
use crate::{connection, task_utils};
pub mod rrd_cache;
+pub mod top_entities;
const COLLECTION_INTERVAL: u64 = 60;
@@ -150,146 +148,3 @@ fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
data_source_type,
);
}
-
-fn insert_sorted<T>(vec: &mut Vec<(usize, T)>, value: (usize, T), limit: usize) {
- let index = match vec.binary_search_by_key(&value.0, |(idx, _)| *idx) {
- Ok(idx) | Err(idx) => idx,
- };
-
- vec.insert(index, value);
- if vec.len() > limit {
- for _ in 0..(vec.len() - limit) {
- vec.remove(0);
- }
- }
-}
-
-// for now simple sum of the values => area under the graph curve
-fn calculate_coefficient(values: &proxmox_rrd::Entry) -> f64 {
- let mut coefficient = 0.0;
- for point in values.data.iter() {
- let value = point.unwrap_or_default();
- if value.is_finite() {
- coefficient += value;
- }
- }
-
- coefficient
-}
-
-// FIXME: cache the values instead of calculate freshly every time?
-// FIXME: find better way to enumerate nodes/guests/etc.(instead of relying on the cache)
-pub fn calculate_top(
- remotes: &HashMap<String, pdm_api_types::remotes::Remote>,
- timeframe: proxmox_rrd_api_types::RrdTimeframe,
- num: usize,
-) -> TopEntities {
- let mut guest_cpu = Vec::new();
- let mut node_cpu = Vec::new();
- let mut node_memory = Vec::new();
-
- for remote_name in remotes.keys() {
- if let Some(data) =
- crate::api::resources::get_cached_resources(remote_name, i64::MAX as u64)
- {
- for res in data.resources {
- let id = res.id().to_string();
- let name = format!("pve/{remote_name}/{id}");
- match &res {
- Resource::PveStorage(_) => {}
- Resource::PveQemu(_) | Resource::PveLxc(_) => {
- if let Some(entity) =
- get_entity(timeframe, remote_name, res, name, "cpu_current")
- {
- let coefficient = (entity.0 * 100.0).round() as usize;
- insert_sorted(&mut guest_cpu, (coefficient, entity.1), num);
- }
- }
- Resource::PveNode(_) => {
- if let Some(entity) = get_entity(
- timeframe,
- remote_name,
- res.clone(),
- name.clone(),
- "cpu_current",
- ) {
- let coefficient = (entity.0 * 100.0).round() as usize;
- insert_sorted(&mut node_cpu, (coefficient, entity.1), num);
- }
- // convert mem/mem_total into a single entity
- if let Some(mut mem) = get_entity(
- timeframe,
- remote_name,
- res.clone(),
- name.clone(),
- "mem_used",
- ) {
- if let Some(mem_total) =
- get_entity(timeframe, remote_name, res, name, "mem_total")
- {
- // skip if we don't have the same amount of data for used and total
- let mem_rrd = &mem.1.rrd_data.data;
- let mem_total_rrd = &mem_total.1.rrd_data.data;
- if mem_rrd.len() != mem_total_rrd.len() {
- continue;
- }
- let coefficient = (100.0 * mem.0 / mem_total.0).round() as usize;
- let mut mem_usage = Vec::new();
- for i in 0..mem_rrd.len() {
- let point = match (mem_rrd[i], mem_total_rrd[i]) {
- (Some(mem), Some(total)) => Some(mem / total),
- _ => None,
- };
- mem_usage.push(point)
- }
- mem.1.rrd_data.data = mem_usage;
- insert_sorted(&mut node_memory, (coefficient, mem.1), num);
- }
- }
- }
- Resource::PbsNode(_) => {}
- Resource::PbsDatastore(_) => {}
- }
- }
- }
- }
-
- TopEntities {
- guest_cpu: guest_cpu.into_iter().map(|(_, entity)| entity).collect(),
- node_cpu: node_cpu.into_iter().map(|(_, entity)| entity).collect(),
- node_memory: node_memory.into_iter().map(|(_, entity)| entity).collect(),
- }
-}
-
-fn get_entity(
- timeframe: proxmox_rrd_api_types::RrdTimeframe,
- remote_name: &String,
- res: Resource,
- name: String,
- metric: &str,
-) -> Option<(f64, TopEntity)> {
- if let Ok(Some(values)) = rrd_cache::extract_data(
- &name,
- metric,
- timeframe,
- proxmox_rrd_api_types::RrdMode::Average,
- ) {
- let coefficient = calculate_coefficient(&values);
- if coefficient > 0.0 {
- return Some((
- coefficient,
- TopEntity {
- remote: remote_name.to_string(),
- resource: res,
- rrd_data: ResourceRrdData {
- start: values.start,
- resolution: values.resolution,
- data: values.data,
- },
- },
- ));
- }
- }
-
- None
-}
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
new file mode 100644
index 00000000..f8e053fb
--- /dev/null
+++ b/server/src/metric_collection/top_entities.rs
@@ -0,0 +1,148 @@
+use std::collections::HashMap;
+
+use pdm_api_types::resource::{Resource, ResourceRrdData, TopEntities, TopEntity};
+
+use super::rrd_cache;
+
+fn insert_sorted<T>(vec: &mut Vec<(usize, T)>, value: (usize, T), limit: usize) {
+ let index = match vec.binary_search_by_key(&value.0, |(idx, _)| *idx) {
+ Ok(idx) | Err(idx) => idx,
+ };
+
+ vec.insert(index, value);
+ if vec.len() > limit {
+ for _ in 0..(vec.len() - limit) {
+ vec.remove(0);
+ }
+ }
+}
+
+// for now simple sum of the values => area under the graph curve
+fn calculate_coefficient(values: &proxmox_rrd::Entry) -> f64 {
+ let mut coefficient = 0.0;
+ for point in values.data.iter() {
+ let value = point.unwrap_or_default();
+ if value.is_finite() {
+ coefficient += value;
+ }
+ }
+
+ coefficient
+}
+
+// FIXME: cache the values instead of calculate freshly every time?
+// FIXME: find better way to enumerate nodes/guests/etc.(instead of relying on the cache)
+pub fn calculate_top(
+ remotes: &HashMap<String, pdm_api_types::remotes::Remote>,
+ timeframe: proxmox_rrd_api_types::RrdTimeframe,
+ num: usize,
+) -> TopEntities {
+ let mut guest_cpu = Vec::new();
+ let mut node_cpu = Vec::new();
+ let mut node_memory = Vec::new();
+
+ for remote_name in remotes.keys() {
+ if let Some(data) =
+ crate::api::resources::get_cached_resources(remote_name, i64::MAX as u64)
+ {
+ for res in data.resources {
+ let id = res.id().to_string();
+ let name = format!("pve/{remote_name}/{id}");
+ match &res {
+ Resource::PveStorage(_) => {}
+ Resource::PveQemu(_) | Resource::PveLxc(_) => {
+ if let Some(entity) =
+ get_entity(timeframe, remote_name, res, name, "cpu_current")
+ {
+ let coefficient = (entity.0 * 100.0).round() as usize;
+ insert_sorted(&mut guest_cpu, (coefficient, entity.1), num);
+ }
+ }
+ Resource::PveNode(_) => {
+ if let Some(entity) = get_entity(
+ timeframe,
+ remote_name,
+ res.clone(),
+ name.clone(),
+ "cpu_current",
+ ) {
+ let coefficient = (entity.0 * 100.0).round() as usize;
+ insert_sorted(&mut node_cpu, (coefficient, entity.1), num);
+ }
+ // convert mem/mem_total into a single entity
+ if let Some(mut mem) = get_entity(
+ timeframe,
+ remote_name,
+ res.clone(),
+ name.clone(),
+ "mem_used",
+ ) {
+ if let Some(mem_total) =
+ get_entity(timeframe, remote_name, res, name, "mem_total")
+ {
+ // skip if we don't have the same amount of data for used and total
+ let mem_rrd = &mem.1.rrd_data.data;
+ let mem_total_rrd = &mem_total.1.rrd_data.data;
+ if mem_rrd.len() != mem_total_rrd.len() {
+ continue;
+ }
+ let coefficient = (100.0 * mem.0 / mem_total.0).round() as usize;
+ let mut mem_usage = Vec::new();
+ for i in 0..mem_rrd.len() {
+ let point = match (mem_rrd[i], mem_total_rrd[i]) {
+ (Some(mem), Some(total)) => Some(mem / total),
+ _ => None,
+ };
+ mem_usage.push(point)
+ }
+ mem.1.rrd_data.data = mem_usage;
+ insert_sorted(&mut node_memory, (coefficient, mem.1), num);
+ }
+ }
+ }
+ Resource::PbsNode(_) => {}
+ Resource::PbsDatastore(_) => {}
+ }
+ }
+ }
+ }
+
+ TopEntities {
+ guest_cpu: guest_cpu.into_iter().map(|(_, entity)| entity).collect(),
+ node_cpu: node_cpu.into_iter().map(|(_, entity)| entity).collect(),
+ node_memory: node_memory.into_iter().map(|(_, entity)| entity).collect(),
+ }
+}
+
+fn get_entity(
+ timeframe: proxmox_rrd_api_types::RrdTimeframe,
+ remote_name: &String,
+ res: Resource,
+ name: String,
+ metric: &str,
+) -> Option<(f64, TopEntity)> {
+ if let Ok(Some(values)) = rrd_cache::extract_data(
+ &name,
+ metric,
+ timeframe,
+ proxmox_rrd_api_types::RrdMode::Average,
+ ) {
+ let coefficient = calculate_coefficient(&values);
+ if coefficient > 0.0 {
+ return Some((
+ coefficient,
+ TopEntity {
+ remote: remote_name.to_string(),
+ resource: res,
+ rrd_data: ResourceRrdData {
+ start: values.start,
+ resolution: values.resolution,
+ data: values.data,
+ },
+ },
+ ));
+ }
+ }
+
+ None
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 06/28] metric collection: save metric data to RRD in separate task
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (4 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 05/28] metric collection: split top_entities split into separate module Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 07/28] metric collection: rework metric poll task Lukas Wagner
` (23 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This is a preparation for concurrent metric collection. While the RRD
cache appears to be safe to concurrent access (it uses an RwLock to
protect the data), delegating all RRD writes to a separate tokio task
makes it IMO easier to reason about and understand. Furthermore, it
decouples the 'metric fetching' and 'metric
storing' parts, making it much easier to write tests for both parts
independently.
For better code separation, this also splits out the new task into
a new submodule.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/metric_collection/mod.rs | 102 +++++++----------------
server/src/metric_collection/rrd_task.rs | 96 +++++++++++++++++++++
2 files changed, 128 insertions(+), 70 deletions(-)
create mode 100644 server/src/metric_collection/rrd_task.rs
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 5540f937..06ade5f0 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -2,18 +2,18 @@ use std::collections::HashMap;
use std::pin::pin;
use anyhow::Error;
-
-use pbs_api_types::{MetricDataPoint, MetricDataType};
-use proxmox_rrd::rrd::DataSourceType;
+use tokio::sync::mpsc::{self, Sender};
use pdm_api_types::remotes::RemoteType;
-use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
use crate::{connection, task_utils};
pub mod rrd_cache;
+mod rrd_task;
pub mod top_entities;
+use rrd_task::RrdStoreRequest;
+
const COLLECTION_INTERVAL: u64 = 60;
/// Initialize the RRD cache
@@ -25,14 +25,22 @@ pub fn init() -> Result<(), Error> {
/// Start the metric collection task.
pub fn start_task() {
+ let (tx, rx) = mpsc::channel(128);
+
+ tokio::spawn(async move {
+ let task_scheduler = pin!(metric_collection_task(tx));
+ let abort_future = pin!(proxmox_daemon::shutdown_future());
+ futures::future::select(task_scheduler, abort_future).await;
+ });
+
tokio::spawn(async move {
- let task_scheduler = pin!(metric_collection_task());
+ let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
let abort_future = pin!(proxmox_daemon::shutdown_future());
futures::future::select(task_scheduler, abort_future).await;
});
}
-async fn metric_collection_task() -> Result<(), Error> {
+async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
loop {
@@ -59,37 +67,33 @@ async fn metric_collection_task() -> Result<(), Error> {
.cluster_metrics_export(Some(true), Some(false), Some(start_time))
.await?;
- //// Involves some blocking file IO
- tokio::task::spawn_blocking(move || {
- let mut most_recent_timestamp = 0;
+ let most_recent =
+ metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
- for data_point in metrics.data {
- most_recent_timestamp =
- most_recent_timestamp.max(data_point.timestamp);
- store_metric_pve(&remote_name_clone, &data_point);
- }
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote_name_clone,
+ metrics,
+ })
+ .await?;
- most_recent_timestamp
- })
- .await
+ Ok::<i64, Error>(most_recent)
}
RemoteType::Pbs => {
let client = connection::make_pbs_client(remote)?;
let metrics = client.metrics(Some(true), Some(start_time)).await?;
- // Involves some blocking file IO
- tokio::task::spawn_blocking(move || {
- let mut most_recent_timestamp = 0;
+ let most_recent =
+ metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
- for data_point in metrics.data {
- most_recent_timestamp =
- most_recent_timestamp.max(data_point.timestamp);
- store_metric_pbs(&remote_name_clone, &data_point);
- }
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote_name_clone,
+ metrics,
+ })
+ .await?;
- most_recent_timestamp
- })
- .await
+ Ok::<i64, Error>(most_recent)
}
}?;
@@ -106,45 +110,3 @@ async fn metric_collection_task() -> Result<(), Error> {
}
}
}
-
-fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
- let name = format!(
- "pve/{remote_name}/{id}/{metric}",
- id = data_point.id,
- metric = data_point.metric,
- );
-
- let data_source_type = match data_point.ty {
- ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
- ClusterMetricsDataType::Counter => DataSourceType::Counter,
- ClusterMetricsDataType::Derive => DataSourceType::Derive,
- };
-
- rrd_cache::update_value(
- &name,
- data_point.value,
- data_point.timestamp,
- data_source_type,
- );
-}
-
-fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
- let name = format!(
- "pbs/{remote_name}/{id}/{metric}",
- id = data_point.id,
- metric = data_point.metric,
- );
-
- let data_source_type = match data_point.ty {
- MetricDataType::Gauge => DataSourceType::Gauge,
- MetricDataType::Counter => DataSourceType::Counter,
- MetricDataType::Derive => DataSourceType::Derive,
- };
-
- rrd_cache::update_value(
- &name,
- data_point.value,
- data_point.timestamp,
- data_source_type,
- );
-}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
new file mode 100644
index 00000000..a72945df
--- /dev/null
+++ b/server/src/metric_collection/rrd_task.rs
@@ -0,0 +1,96 @@
+use anyhow::Error;
+use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
+use proxmox_rrd::rrd::DataSourceType;
+use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
+use tokio::sync::mpsc::Receiver;
+
+use super::rrd_cache;
+
+/// Store request for the RRD task.
+pub(super) enum RrdStoreRequest {
+ /// Store PVE metrics.
+ Pve {
+ /// Remote name.
+ remote: String,
+ /// Metric data.
+ metrics: ClusterMetrics,
+ },
+ /// Store PBS metrics.
+ Pbs {
+ /// Remote name.
+ remote: String,
+ /// Metric data.
+ metrics: Metrics,
+ },
+}
+
+/// Task which stores received metrics in the RRD. Metric data is fed into
+/// this task via a MPSC channel.
+pub(super) async fn store_in_rrd_task(
+ mut receiver: Receiver<RrdStoreRequest>,
+) -> Result<(), Error> {
+ while let Some(msg) = receiver.recv().await {
+ // Involves some blocking file IO
+ let res = tokio::task::spawn_blocking(move || match msg {
+ RrdStoreRequest::Pve { remote, metrics } => {
+ for data_point in metrics.data {
+ store_metric_pve(&remote, &data_point);
+ }
+ }
+ RrdStoreRequest::Pbs { remote, metrics } => {
+ for data_point in metrics.data {
+ store_metric_pbs(&remote, &data_point);
+ }
+ }
+ })
+ .await;
+
+ if let Err(err) = res {
+ log::error!("error in rrd task when attempting to save metrics: {err}");
+ }
+ }
+
+ Ok(())
+}
+
+fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
+ let name = format!(
+ "pve/{remote_name}/{id}/{metric}",
+ id = data_point.id,
+ metric = data_point.metric,
+ );
+
+ let data_source_type = match data_point.ty {
+ ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
+ ClusterMetricsDataType::Counter => DataSourceType::Counter,
+ ClusterMetricsDataType::Derive => DataSourceType::Derive,
+ };
+
+ rrd_cache::update_value(
+ &name,
+ data_point.value,
+ data_point.timestamp,
+ data_source_type,
+ );
+}
+
+fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
+ let name = format!(
+ "pbs/{remote_name}/{id}/{metric}",
+ id = data_point.id,
+ metric = data_point.metric,
+ );
+
+ let data_source_type = match data_point.ty {
+ MetricDataType::Gauge => DataSourceType::Gauge,
+ MetricDataType::Counter => DataSourceType::Counter,
+ MetricDataType::Derive => DataSourceType::Derive,
+ };
+
+ rrd_cache::update_value(
+ &name,
+ data_point.value,
+ data_point.timestamp,
+ data_source_type,
+ );
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 07/28] metric collection: rework metric poll task
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (5 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 06/28] metric collection: save metric data to RRD in separate task Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 08/28] metric collection: persist state after metric collection Lukas Wagner
` (22 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
Metric collection is mostly limited by network latency, not disk IO or
CPU. This means we should be able get significant speedups by polling
remotes concurrently by spawning tokio workers.
To avoid load/IO spikes, we limit the number of concurrent connections
using a semaphore.
Each concurrent task which fetches a single remote waits a random amount
of time before actually connecting to the remote. The upper and lower
bounds for this random delay can be configured in the metric collection
settings. The main aim of this mechanism is to reduce load spikes.
Furthermore, each time when the main collection interval timer
fires, a random delay is introduced before actually starting the
collection process.
This is useful due to how we set up the timer; we configure
it to figure at aligned points in time. For instance, if the
collection interval is set to 60s, the timer fires at
minute boundaries. Using this random offset, we can avoid
triggering at the same time as other timers (cron, systemd).
Furthermore, we add a mechanism to trigger metric collection manually.
This is achieve by using `tokio::select!` on both, the timer's `tick`
method and the `recv` method of an `mpsc` which is used to send control
messages to the metric collection task.
For better code structure, the collection task is split into a separate
module.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
Changes since v1:
- get_settings_or_default function: use unwrap_or_default instead
of match
- Add `rand` crate dependency to d/control
- Refactor sleep helpers, consolidate into a single
parameterised funtion
- Get rid of .unwrap when spawning up the metric collection task
- Let trigger_metric_collection_for_remote take a String instead of
&str
Cargo.toml | 1 +
debian/control | 1 +
server/Cargo.toml | 1 +
server/src/bin/proxmox-datacenter-api.rs | 2 +-
.../src/metric_collection/collection_task.rs | 300 ++++++++++++++++++
server/src/metric_collection/mod.rs | 123 +++----
6 files changed, 347 insertions(+), 81 deletions(-)
create mode 100644 server/src/metric_collection/collection_task.rs
diff --git a/Cargo.toml b/Cargo.toml
index 4f3b1d03..7dd60a53 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -108,6 +108,7 @@ once_cell = "1.3.1"
openssl = "0.10.40"
percent-encoding = "2.1"
pin-project-lite = "0.2"
+rand = "0.8"
regex = "1.5.5"
serde = { version = "1.0", features = ["derive"] }
serde_cbor = "0.11.1"
diff --git a/debian/control b/debian/control
index 399ad86d..1f7ad17c 100644
--- a/debian/control
+++ b/debian/control
@@ -88,6 +88,7 @@ Build-Depends: cargo:native,
librust-proxmox-time-api-0.1+default-dev,
librust-proxmox-time-api-0.1+impl-dev,
librust-proxmox-uuid-1+default-dev,
+ librust-rand-0.8+default-dev,
librust-regex-1+default-dev (>= 1.5.5-~~),
librust-serde-1+default-dev,
librust-serde-1+derive-dev,
diff --git a/server/Cargo.toml b/server/Cargo.toml
index 7b0058e1..c8308f2c 100644
--- a/server/Cargo.toml
+++ b/server/Cargo.toml
@@ -24,6 +24,7 @@ nix.workspace = true
once_cell.workspace = true
openssl.workspace = true
percent-encoding.workspace = true
+rand.workspace = true
serde.workspace = true
serde_json.workspace = true
syslog.workspace = true
diff --git a/server/src/bin/proxmox-datacenter-api.rs b/server/src/bin/proxmox-datacenter-api.rs
index a79094d5..6e85e523 100644
--- a/server/src/bin/proxmox-datacenter-api.rs
+++ b/server/src/bin/proxmox-datacenter-api.rs
@@ -286,7 +286,7 @@ async fn run(debug: bool) -> Result<(), Error> {
});
start_task_scheduler();
- metric_collection::start_task();
+ metric_collection::start_task()?;
resource_cache::start_task();
server.await?;
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
new file mode 100644
index 00000000..b55c8e92
--- /dev/null
+++ b/server/src/metric_collection/collection_task.rs
@@ -0,0 +1,300 @@
+use std::{collections::HashMap, sync::Arc, time::Duration};
+
+use anyhow::Error;
+use rand::Rng;
+use tokio::{
+ sync::{
+ mpsc::{Receiver, Sender},
+ OwnedSemaphorePermit, Semaphore,
+ },
+ time::Interval,
+};
+
+use proxmox_section_config::typed::SectionConfigData;
+
+use pdm_api_types::{
+ remotes::{Remote, RemoteType},
+ CollectionSettings,
+};
+use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
+
+use crate::{connection, task_utils};
+
+use super::rrd_task::RrdStoreRequest;
+
+/// Control messages for the metric collection task.
+pub(super) enum ControlMsg {
+ CollectSingleRemote(String),
+ CollectAllRemotes,
+}
+
+/// Task which periodically collects metrics from all remotes and stores
+/// them in the local metrics database.
+pub(super) struct MetricCollectionTask {
+ most_recent_timestamps: HashMap<String, i64>,
+ settings: CollectionSettings,
+ metric_data_tx: Sender<RrdStoreRequest>,
+ control_message_rx: Receiver<ControlMsg>,
+}
+
+impl MetricCollectionTask {
+ /// Create a new metric collection task.
+ pub(super) fn new(
+ metric_data_tx: Sender<RrdStoreRequest>,
+ control_message_rx: Receiver<ControlMsg>,
+ ) -> Result<Self, Error> {
+ let settings = Self::get_settings_or_default();
+
+ Ok(Self {
+ most_recent_timestamps: HashMap::new(),
+ settings,
+ metric_data_tx,
+ control_message_rx,
+ })
+ }
+
+ /// Run the metric collection task.
+ ///
+ /// This function never returns.
+ #[tracing::instrument(skip_all, name = "metric_collection_task")]
+ pub(super) async fn run(&mut self) {
+ let mut timer = Self::setup_timer(self.settings.collection_interval_or_default());
+
+ log::debug!(
+ "metric collection starting up. Collection interval set to {} seconds.",
+ self.settings.collection_interval_or_default()
+ );
+
+ loop {
+ let old_settings = self.settings.clone();
+ tokio::select! {
+ _ = timer.tick() => {
+ // Reload settings in case they have changed in the meanwhile
+ self.settings = Self::get_settings_or_default();
+
+ log::debug!("starting metric collection from all remotes - triggered by timer");
+ Self::sleep_for_random_millis(
+ self.settings.min_interval_offset_or_default() * 1000,
+ self.settings.max_interval_offset_or_default() * 1000,
+ "interval-offset",
+ ).await;
+
+ if let Some(remotes) = Self::load_remote_config() {
+ let to_fetch = remotes.order.as_slice();
+ self.fetch_remotes(&remotes, to_fetch).await;
+ }
+ }
+
+ val = self.control_message_rx.recv() => {
+ // Reload settings in case they have changed in the meanwhile
+ self.settings = Self::get_settings_or_default();
+ match val {
+ Some(ControlMsg::CollectSingleRemote(remote)) => {
+ if let Some(remotes) = Self::load_remote_config() {
+ log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
+ self.fetch_remotes(&remotes, &[remote]).await;
+ }
+ }
+ Some(ControlMsg::CollectAllRemotes) => {
+ if let Some(remotes) = Self::load_remote_config() {
+ log::debug!("starting metric collection from all remotes - triggered by control message");
+ self.fetch_remotes(&remotes, &remotes.order).await;
+ }
+ }
+ _ => {},
+ }
+ }
+ }
+
+ let interval = self.settings.collection_interval_or_default();
+
+ if old_settings.collection_interval_or_default() != interval {
+ log::info!(
+ "metric collection interval changed to {} seconds, reloading timer",
+ interval
+ );
+ timer = Self::setup_timer(interval);
+ }
+ }
+ }
+
+ /// Sleep between `min` and `max` milliseconds.
+ ///
+ /// If `min` is larger than `max`, `min` will be set to `max` and a log message
+ /// will be printed.
+ async fn sleep_for_random_millis(mut min: u64, max: u64, param_base: &str) {
+ if min > max {
+ log::warn!(
+ "min-{param_base} is larger than max-{param_base} ({min} > {max}) - \
+ capping it to max-{param_base} ({max})"
+ );
+ min = max;
+ }
+
+ let jitter = {
+ let mut rng = rand::thread_rng();
+ rng.gen_range(min..=max)
+ };
+
+ tokio::time::sleep(Duration::from_millis(jitter)).await;
+ }
+
+ fn get_settings_or_default() -> CollectionSettings {
+ // We want to fall back to defaults if
+ // - the config file does not exist (no error should be logged)
+ // - the section type is wrong or if the config failed to parse (log an error in this
+ // case)
+
+ fn get_settings_impl() -> Result<CollectionSettings, Error> {
+ let (config, _) = pdm_config::metric_collection::config()?;
+
+ let all_sections: Vec<CollectionSettings> =
+ config.convert_to_typed_array(COLLECTION_SETTINGS_TYPE)?;
+
+ for section in all_sections {
+ if section.id == "default" {
+ return Ok(section);
+ }
+ }
+
+ Ok(CollectionSettings::new("default"))
+ }
+
+ get_settings_impl().unwrap_or_else(|err| {
+ log::error!(
+ "could not read metric collection settings: {err} - falling back to default config"
+ );
+ CollectionSettings::new("default")
+ })
+ }
+
+ /// Set up a [`tokio::time::Interval`] instance with the provided interval.
+ /// The timer will be aligned, e.g. an interval of `60` will let the timer
+ /// fire at minute boundaries.
+ fn setup_timer(interval: u64) -> Interval {
+ let mut timer = tokio::time::interval(Duration::from_secs(interval));
+ let first_run = task_utils::next_aligned_instant(interval).into();
+ timer.reset_at(first_run);
+
+ timer
+ }
+
+ /// Convenience helper to load `remote.cfg`, logging the error
+ /// and returning `None` if the config could not be read.
+ fn load_remote_config() -> Option<SectionConfigData<Remote>> {
+ match pdm_config::remotes::config() {
+ Ok((remotes, _)) => Some(remotes),
+ Err(e) => {
+ log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
+ None
+ }
+ }
+ }
+
+ /// Fetch metric data from a provided list of remotes concurrently.
+ /// The maximum number of concurrent connections is determined by
+ /// `max_concurrent_connections` in the [`CollectionSettings`]
+ /// instance in `self`.
+ async fn fetch_remotes(
+ &mut self,
+ remote_config: &SectionConfigData<Remote>,
+ remotes_to_fetch: &[String],
+ ) {
+ let semaphore = Arc::new(Semaphore::new(
+ self.settings.max_concurrent_connections_or_default(),
+ ));
+ let mut handles = Vec::new();
+
+ for remote_name in remotes_to_fetch {
+ let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
+
+ // unwrap is okay here, acquire_* will only fail if `close` has been
+ // called on the semaphore.
+ let permit = Arc::clone(&semaphore).acquire_owned().await.unwrap();
+
+ if let Some(remote) = remote_config.get(remote_name).cloned() {
+ log::debug!("fetching remote '{}'", remote.id);
+ let handle = tokio::spawn(Self::fetch_single_remote(
+ self.settings.clone(),
+ remote,
+ start_time,
+ self.metric_data_tx.clone(),
+ permit,
+ ));
+
+ handles.push((remote_name.clone(), handle));
+ }
+ }
+
+ for (remote_name, handle) in handles {
+ let res = handle.await;
+
+ match res {
+ Ok(Ok(ts)) => {
+ self.most_recent_timestamps
+ .insert(remote_name.to_string(), ts);
+ }
+ Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
+ Err(err) => {
+ log::error!(
+ "join error for metric collection task for remote {remote_name}: {err}"
+ )
+ }
+ }
+ }
+ }
+
+ /// Fetch a single remote.
+ #[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
+ async fn fetch_single_remote(
+ settings: CollectionSettings,
+ remote: Remote,
+ start_time: i64,
+ sender: Sender<RrdStoreRequest>,
+ _permit: OwnedSemaphorePermit,
+ ) -> Result<i64, Error> {
+ Self::sleep_for_random_millis(
+ settings.min_connection_delay_or_default(),
+ settings.max_connection_delay_or_default(),
+ "connection-delay",
+ )
+ .await;
+
+ let most_recent_timestamp = match remote.ty {
+ RemoteType::Pve => {
+ let client = connection::make_pve_client(&remote)?;
+ let metrics = client
+ .cluster_metrics_export(Some(true), Some(false), Some(start_time))
+ .await?;
+
+ let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote.id.clone(),
+ metrics,
+ })
+ .await?;
+
+ most_recent
+ }
+ RemoteType::Pbs => {
+ let client = connection::make_pbs_client(&remote)?;
+ let metrics = client.metrics(Some(true), Some(start_time)).await?;
+
+ let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote.id.clone(),
+ metrics,
+ })
+ .await?;
+
+ most_recent
+ }
+ };
+
+ Ok(most_recent_timestamp)
+ }
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 06ade5f0..9b203615 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -1,20 +1,17 @@
-use std::collections::HashMap;
use std::pin::pin;
+use std::sync::OnceLock;
-use anyhow::Error;
+use anyhow::{bail, Error};
use tokio::sync::mpsc::{self, Sender};
-use pdm_api_types::remotes::RemoteType;
-
-use crate::{connection, task_utils};
-
+mod collection_task;
pub mod rrd_cache;
mod rrd_task;
pub mod top_entities;
-use rrd_task::RrdStoreRequest;
+use collection_task::{ControlMsg, MetricCollectionTask};
-const COLLECTION_INTERVAL: u64 = 60;
+static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
/// Initialize the RRD cache
pub fn init() -> Result<(), Error> {
@@ -24,89 +21,55 @@ pub fn init() -> Result<(), Error> {
}
/// Start the metric collection task.
-pub fn start_task() {
- let (tx, rx) = mpsc::channel(128);
+pub fn start_task() -> Result<(), Error> {
+ let (metric_data_tx, metric_data_rx) = mpsc::channel(128);
+
+ let (trigger_collection_tx, trigger_collection_rx) = mpsc::channel(128);
+ if CONTROL_MESSAGE_TX.set(trigger_collection_tx).is_err() {
+ bail!("control message sender alread set");
+ }
tokio::spawn(async move {
- let task_scheduler = pin!(metric_collection_task(tx));
+ let metric_collection_task_future = pin!(async move {
+ match MetricCollectionTask::new(metric_data_tx, trigger_collection_rx) {
+ Ok(mut task) => task.run().await,
+ Err(err) => log::error!("could not start metric collection task: {err}"),
+ }
+ });
+
let abort_future = pin!(proxmox_daemon::shutdown_future());
- futures::future::select(task_scheduler, abort_future).await;
+ futures::future::select(metric_collection_task_future, abort_future).await;
});
tokio::spawn(async move {
- let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
+ let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
let abort_future = pin!(proxmox_daemon::shutdown_future());
- futures::future::select(task_scheduler, abort_future).await;
+ futures::future::select(rrd_task_future, abort_future).await;
});
-}
-async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
- let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
+ Ok(())
+}
- loop {
- let delay_target = task_utils::next_aligned_instant(COLLECTION_INTERVAL);
- tokio::time::sleep_until(tokio::time::Instant::from_std(delay_target)).await;
+/// Schedule metric collection for a given remote as soon as possible.
+///
+/// Has no effect if the tx end of the channel has not been initialized yet.
+/// Returns an error if the mpsc channel has been closed already.
+pub async fn trigger_metric_collection_for_remote(remote: String) -> Result<(), Error> {
+ if let Some(sender) = CONTROL_MESSAGE_TX.get() {
+ sender.send(ControlMsg::CollectSingleRemote(remote)).await?;
+ }
- let remotes = match pdm_config::remotes::config() {
- Ok((remotes, _)) => remotes,
- Err(e) => {
- log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
- continue;
- }
- };
-
- for (remote_name, remote) in &remotes.sections {
- let start_time = *most_recent_timestamps.get(remote_name).unwrap_or(&0);
- let remote_name_clone = remote_name.clone();
-
- let res = async {
- let most_recent_timestamp = match remote.ty {
- RemoteType::Pve => {
- let client = connection::make_pve_client(remote)?;
- let metrics = client
- .cluster_metrics_export(Some(true), Some(false), Some(start_time))
- .await?;
-
- let most_recent =
- metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
- sender
- .send(RrdStoreRequest::Pve {
- remote: remote_name_clone,
- metrics,
- })
- .await?;
-
- Ok::<i64, Error>(most_recent)
- }
- RemoteType::Pbs => {
- let client = connection::make_pbs_client(remote)?;
- let metrics = client.metrics(Some(true), Some(start_time)).await?;
-
- let most_recent =
- metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
- sender
- .send(RrdStoreRequest::Pbs {
- remote: remote_name_clone,
- metrics,
- })
- .await?;
-
- Ok::<i64, Error>(most_recent)
- }
- }?;
-
- Ok::<i64, Error>(most_recent_timestamp)
- }
- .await;
+ Ok(())
+}
- match res {
- Ok(ts) => {
- most_recent_timestamps.insert(remote_name.to_string(), ts);
- }
- Err(err) => log::error!("failed to collect metrics for {remote_name}: {err}"),
- }
- }
+/// Schedule metric collection for all remotes as soon as possible.
+///
+/// Has no effect if the tx end of the channel has not been initialized yet.
+/// Returns an error if the mpsc channel has been closed already.
+pub async fn trigger_metric_collection() -> Result<(), Error> {
+ if let Some(sender) = CONTROL_MESSAGE_TX.get() {
+ sender.send(ControlMsg::CollectAllRemotes).await?;
}
+
+ Ok(())
}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 08/28] metric collection: persist state after metric collection
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (6 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 07/28] metric collection: rework metric poll task Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 09/28] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
` (21 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
The metric collection has to maintain some state, at the moment this
is only the most recent timestamp from all received metric data points.
We use this as a cut-off time when requesting metric from the remote, as
everything *older* than that should already be stored in the database.
Up until now, the state was only stored in memory, which means that it
was lost when the daemon restarted.
The changes in this commit makes the metric collection system
load/save the state from a file in
/var/lib/proxmox-datacenter-manager/metric-collection-state.json
The following data points are stored for every remote:
- most-recent-datapoint:
Timestamp of the most recent datapoint, used as a cut-off when
fetching new metrics
- last-collection:
`last-collection` field saves the timestamp of the last *successful*
metric collection for a remote as local timestamp.
- error:
String representation of any error that occured in the last collection
attempt
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
Changes since v1:
- use .inspect_err(..).unwrap_or_default() instead of a manual match
in `MetricCollectionState::new`
.../src/metric_collection/collection_task.rs | 135 ++++++++++++------
server/src/metric_collection/mod.rs | 1 +
server/src/metric_collection/rrd_task.rs | 60 ++++++--
server/src/metric_collection/state.rs | 126 ++++++++++++++++
4 files changed, 269 insertions(+), 53 deletions(-)
create mode 100644 server/src/metric_collection/state.rs
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index b55c8e92..f985bd3e 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -1,16 +1,17 @@
-use std::{collections::HashMap, sync::Arc, time::Duration};
+use std::{sync::Arc, time::Duration};
use anyhow::Error;
use rand::Rng;
use tokio::{
sync::{
mpsc::{Receiver, Sender},
- OwnedSemaphorePermit, Semaphore,
+ oneshot, OwnedSemaphorePermit, Semaphore,
},
time::Interval,
};
use proxmox_section_config::typed::SectionConfigData;
+use proxmox_sys::fs::CreateOptions;
use pdm_api_types::{
remotes::{Remote, RemoteType},
@@ -20,7 +21,16 @@ use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
use crate::{connection, task_utils};
-use super::rrd_task::RrdStoreRequest;
+use super::{
+ rrd_task::{RrdStoreRequest, RrdStoreResult},
+ state::{MetricCollectionState, RemoteStatus},
+};
+
+/// Location of the metric collection state file.
+const METRIC_COLLECTION_STATE_FILE: &str = concat!(
+ pdm_buildcfg::PDM_STATE_DIR_M!(),
+ "/metric-collection-state.json"
+);
/// Control messages for the metric collection task.
pub(super) enum ControlMsg {
@@ -31,7 +41,7 @@ pub(super) enum ControlMsg {
/// Task which periodically collects metrics from all remotes and stores
/// them in the local metrics database.
pub(super) struct MetricCollectionTask {
- most_recent_timestamps: HashMap<String, i64>,
+ state: MetricCollectionState,
settings: CollectionSettings,
metric_data_tx: Sender<RrdStoreRequest>,
control_message_rx: Receiver<ControlMsg>,
@@ -44,9 +54,10 @@ impl MetricCollectionTask {
control_message_rx: Receiver<ControlMsg>,
) -> Result<Self, Error> {
let settings = Self::get_settings_or_default();
+ let state = load_state()?;
Ok(Self {
- most_recent_timestamps: HashMap::new(),
+ state,
settings,
metric_data_tx,
control_message_rx,
@@ -115,6 +126,10 @@ impl MetricCollectionTask {
);
timer = Self::setup_timer(interval);
}
+
+ if let Err(err) = self.state.save() {
+ log::error!("could not update metric collection state: {err}");
+ }
}
}
@@ -206,7 +221,11 @@ impl MetricCollectionTask {
let mut handles = Vec::new();
for remote_name in remotes_to_fetch {
- let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
+ let status = self
+ .state
+ .get_status(remote_name)
+ .cloned()
+ .unwrap_or_default();
// unwrap is okay here, acquire_* will only fail if `close` has been
// called on the semaphore.
@@ -217,7 +236,7 @@ impl MetricCollectionTask {
let handle = tokio::spawn(Self::fetch_single_remote(
self.settings.clone(),
remote,
- start_time,
+ status,
self.metric_data_tx.clone(),
permit,
));
@@ -231,8 +250,7 @@ impl MetricCollectionTask {
match res {
Ok(Ok(ts)) => {
- self.most_recent_timestamps
- .insert(remote_name.to_string(), ts);
+ self.state.set_status(remote_name, ts);
}
Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
Err(err) => {
@@ -249,52 +267,85 @@ impl MetricCollectionTask {
async fn fetch_single_remote(
settings: CollectionSettings,
remote: Remote,
- start_time: i64,
+ mut status: RemoteStatus,
sender: Sender<RrdStoreRequest>,
_permit: OwnedSemaphorePermit,
- ) -> Result<i64, Error> {
+ ) -> Result<RemoteStatus, Error> {
Self::sleep_for_random_millis(
settings.min_connection_delay_or_default(),
settings.max_connection_delay_or_default(),
"connection-delay",
)
.await;
+ let (result_tx, result_rx) = oneshot::channel();
+
+ let now = proxmox_time::epoch_i64();
+
+ let res: Result<RrdStoreResult, Error> = async {
+ match remote.ty {
+ RemoteType::Pve => {
+ let client = connection::make_pve_client(&remote)?;
+ let metrics = client
+ .cluster_metrics_export(
+ Some(true),
+ Some(false),
+ Some(status.most_recent_datapoint),
+ )
+ .await?;
+
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote.id.clone(),
+ metrics,
+ channel: result_tx,
+ })
+ .await?;
+ }
+ RemoteType::Pbs => {
+ let client = connection::make_pbs_client(&remote)?;
+ let metrics = client
+ .metrics(Some(true), Some(status.most_recent_datapoint))
+ .await?;
+
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote.id.clone(),
+ metrics,
+ channel: result_tx,
+ })
+ .await?;
+ }
+ }
- let most_recent_timestamp = match remote.ty {
- RemoteType::Pve => {
- let client = connection::make_pve_client(&remote)?;
- let metrics = client
- .cluster_metrics_export(Some(true), Some(false), Some(start_time))
- .await?;
-
- let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
- sender
- .send(RrdStoreRequest::Pve {
- remote: remote.id.clone(),
- metrics,
- })
- .await?;
+ result_rx.await.map_err(Error::from)
+ }
+ .await;
- most_recent
+ match res {
+ Ok(result) => {
+ status.most_recent_datapoint = result.most_recent_timestamp;
+ status.last_collection = Some(now);
+ status.error = None;
+ }
+ Err(err) => {
+ status.error = Some(err.to_string());
+ log::error!("coud not fetch metrics from '{}': {err}", remote.id);
}
- RemoteType::Pbs => {
- let client = connection::make_pbs_client(&remote)?;
- let metrics = client.metrics(Some(true), Some(start_time)).await?;
+ }
- let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+ Ok(status)
+ }
+}
- sender
- .send(RrdStoreRequest::Pbs {
- remote: remote.id.clone(),
- metrics,
- })
- .await?;
+/// Load the metric collection state file.
+pub(super) fn load_state() -> Result<MetricCollectionState, Error> {
+ let api_uid = pdm_config::api_user()?.uid;
+ let api_gid = pdm_config::api_group()?.gid;
- most_recent
- }
- };
+ let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
- Ok(most_recent_timestamp)
- }
+ Ok(MetricCollectionState::new(
+ METRIC_COLLECTION_STATE_FILE.into(),
+ file_options,
+ ))
}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 9b203615..9cd60455 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -7,6 +7,7 @@ use tokio::sync::mpsc::{self, Sender};
mod collection_task;
pub mod rrd_cache;
mod rrd_task;
+mod state;
pub mod top_entities;
use collection_task::{ControlMsg, MetricCollectionTask};
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index a72945df..1c618f54 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -1,8 +1,10 @@
use anyhow::Error;
-use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
+use tokio::sync::{mpsc::Receiver, oneshot};
+
use proxmox_rrd::rrd::DataSourceType;
+
+use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
-use tokio::sync::mpsc::Receiver;
use super::rrd_cache;
@@ -14,6 +16,8 @@ pub(super) enum RrdStoreRequest {
remote: String,
/// Metric data.
metrics: ClusterMetrics,
+ /// Oneshot channel to return the [`RrdStoreResult`].
+ channel: oneshot::Sender<RrdStoreResult>,
},
/// Store PBS metrics.
Pbs {
@@ -21,9 +25,17 @@ pub(super) enum RrdStoreRequest {
remote: String,
/// Metric data.
metrics: Metrics,
+ /// Oneshot channel to return the [`RrdStoreResult`].
+ channel: oneshot::Sender<RrdStoreResult>,
},
}
+/// Result for a [`RrdStoreRequest`].
+pub(super) struct RrdStoreResult {
+ /// Most recent timestamp of any stored metric datapoint (UNIX epoch).
+ pub(super) most_recent_timestamp: i64,
+}
+
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
@@ -31,17 +43,43 @@ pub(super) async fn store_in_rrd_task(
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
// Involves some blocking file IO
- let res = tokio::task::spawn_blocking(move || match msg {
- RrdStoreRequest::Pve { remote, metrics } => {
- for data_point in metrics.data {
- store_metric_pve(&remote, &data_point);
+ let res = tokio::task::spawn_blocking(move || {
+ let mut most_recent_timestamp = 0;
+ let channel = match msg {
+ RrdStoreRequest::Pve {
+ remote,
+ metrics,
+ channel,
+ } => {
+ for data_point in metrics.data {
+ most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
+ store_metric_pve(&remote, &data_point);
+ }
+
+ channel
}
- }
- RrdStoreRequest::Pbs { remote, metrics } => {
- for data_point in metrics.data {
- store_metric_pbs(&remote, &data_point);
+ RrdStoreRequest::Pbs {
+ remote,
+ metrics,
+ channel,
+ } => {
+ for data_point in metrics.data {
+ most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
+ store_metric_pbs(&remote, &data_point);
+ }
+
+ channel
}
- }
+ };
+
+ if channel
+ .send(RrdStoreResult {
+ most_recent_timestamp,
+ })
+ .is_err()
+ {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
})
.await;
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
new file mode 100644
index 00000000..5b04ea61
--- /dev/null
+++ b/server/src/metric_collection/state.rs
@@ -0,0 +1,126 @@
+use std::{
+ collections::HashMap,
+ path::{Path, PathBuf},
+};
+
+use anyhow::Error;
+use serde::{Deserialize, Serialize};
+
+use proxmox_sys::fs::CreateOptions;
+
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// Metric collection state file content.
+struct State {
+ remote_status: HashMap<String, RemoteStatus>,
+}
+
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// A remote's metric collection state.
+pub struct RemoteStatus {
+ /// Most recent datapoint - time stamp is based on remote time
+ pub most_recent_datapoint: i64,
+ /// Last successful metric collection - timestamp based on PDM's time
+ pub last_collection: Option<i64>,
+ #[serde(skip_serializing_if = "Option::is_none")]
+ /// Any error that occured during the last metric collection attempt.
+ pub error: Option<String>,
+}
+
+/// Manage and persist metric collection state.
+pub struct MetricCollectionState {
+ /// Path to the persisted state
+ path: PathBuf,
+ /// File owner/perms for the persisted state file
+ file_options: CreateOptions,
+ /// The current state
+ state: State,
+}
+
+impl MetricCollectionState {
+ /// Initialize state by trying to load the existing statefile. If the file does not exist,
+ /// state will be empty. If the file failed to load, state will be empty and
+ /// and error will be logged.
+ pub fn new(statefile: PathBuf, file_options: CreateOptions) -> Self {
+ let state = Self::load_or_default(&statefile)
+ .inspect_err(|err| {
+ log::error!("could not load metric collection state: {err}");
+ })
+ .unwrap_or_default();
+
+ Self {
+ path: statefile,
+ file_options,
+ state,
+ }
+ }
+
+ /// Set a remote's status.
+ pub fn set_status(&mut self, remote: String, remote_state: RemoteStatus) {
+ self.state.remote_status.insert(remote, remote_state);
+ }
+
+ /// Get a remote's status.
+ pub fn get_status(&self, remote: &str) -> Option<&RemoteStatus> {
+ self.state.remote_status.get(remote)
+ }
+
+ /// Persist the state to the statefile.
+ pub fn save(&self) -> Result<(), Error> {
+ let data = serde_json::to_vec_pretty(&self.state)?;
+ proxmox_sys::fs::replace_file(&self.path, &data, self.file_options.clone(), true)?;
+
+ Ok(())
+ }
+
+ fn load_or_default(path: &Path) -> Result<State, Error> {
+ let content = proxmox_sys::fs::file_read_optional_string(path)?;
+
+ if let Some(content) = content {
+ Ok(serde_json::from_str(&content)?)
+ } else {
+ Ok(Default::default())
+ }
+ }
+}
+
+#[cfg(test)]
+mod tests {
+ use proxmox_sys::fs::CreateOptions;
+
+ use super::*;
+
+ use crate::test_support::temp::NamedTempFile;
+
+ fn get_options() -> CreateOptions {
+ CreateOptions::new()
+ .owner(nix::unistd::Uid::effective())
+ .group(nix::unistd::Gid::effective())
+ .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
+ }
+
+ #[test]
+ fn save_and_load() -> Result<(), Error> {
+ let file = NamedTempFile::new(get_options())?;
+ let options = get_options();
+ let mut state = MetricCollectionState::new(file.path().into(), options.clone());
+
+ state.set_status(
+ "some-remote".into(),
+ RemoteStatus {
+ most_recent_datapoint: 1234,
+ ..Default::default()
+ },
+ );
+
+ state.save()?;
+
+ let state = MetricCollectionState::new(file.path().into(), options);
+
+ let status = state.get_status("some-remote").unwrap();
+ assert_eq!(status.most_recent_datapoint, 1234);
+
+ Ok(())
+ }
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 09/28] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (7 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 08/28] metric collection: persist state after metric collection Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 10/28] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
` (20 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
It makes sense to limit the collection frequency per remote, especially
since we will have the ability to trigger collection manually.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/metric_collection/collection_task.rs | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index f985bd3e..744a7ccc 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -15,7 +15,7 @@ use proxmox_sys::fs::CreateOptions;
use pdm_api_types::{
remotes::{Remote, RemoteType},
- CollectionSettings,
+ CollectionSettings, MIN_COLLECTION_INTERVAL,
};
use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
@@ -219,6 +219,7 @@ impl MetricCollectionTask {
self.settings.max_concurrent_connections_or_default(),
));
let mut handles = Vec::new();
+ let now = proxmox_time::epoch_i64();
for remote_name in remotes_to_fetch {
let status = self
@@ -227,6 +228,13 @@ impl MetricCollectionTask {
.cloned()
.unwrap_or_default();
+ if now - status.last_collection.unwrap_or(0) < MIN_COLLECTION_INTERVAL as i64 {
+ log::debug!(
+ "skipping metric collection for remote '{remote_name}' - data is recent enough"
+ );
+ continue;
+ }
+
// unwrap is okay here, acquire_* will only fail if `close` has been
// called on the semaphore.
let permit = Arc::clone(&semaphore).acquire_owned().await.unwrap();
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 10/28] metric collection: collect overdue metrics on startup/timer change
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (8 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 09/28] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 11/28] metric collection: add tests for the fetch_remotes function Lukas Wagner
` (19 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
Due to the fact that the timer fires at aligned points in time and might
now fire right away after being set up, it could happen that we get gaps
in the data if we change the timer interval or at daemon startup.
To mitigate this, on daemon startup and also if the collection interval
changes, we
- check if the time until the next scheduled regular collection
plus the time *since* the last successful collection exceeds
the configured collection interval
- if yes, we collect immediately
- if no, we do nothing and let the remote be collected at the
next timer tick
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
Changes since v1:
- Document return values of `setup_timer`
.../src/metric_collection/collection_task.rs | 84 +++++++++++++++----
1 file changed, 66 insertions(+), 18 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 744a7ccc..9467175b 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -1,4 +1,7 @@
-use std::{sync::Arc, time::Duration};
+use std::{
+ sync::Arc,
+ time::{Duration, Instant},
+};
use anyhow::Error;
use rand::Rng;
@@ -69,15 +72,18 @@ impl MetricCollectionTask {
/// This function never returns.
#[tracing::instrument(skip_all, name = "metric_collection_task")]
pub(super) async fn run(&mut self) {
- let mut timer = Self::setup_timer(self.settings.collection_interval_or_default());
+ let (mut timer, mut first_tick) =
+ Self::setup_timer(self.settings.collection_interval_or_default());
- log::debug!(
- "metric collection starting up. Collection interval set to {} seconds.",
- self.settings.collection_interval_or_default()
- );
+ // Check and fetch any remote which would be overdue by the time the
+ // timer first fires.
+ if let Some(remote_config) = Self::load_remote_config() {
+ self.fetch_overdue(&remote_config, first_tick).await;
+ }
loop {
- let old_settings = self.settings.clone();
+ // Remember current collection interval.
+ let current_interval = self.settings.collection_interval_or_default();
tokio::select! {
_ = timer.tick() => {
// Reload settings in case they have changed in the meanwhile
@@ -117,14 +123,17 @@ impl MetricCollectionTask {
}
}
- let interval = self.settings.collection_interval_or_default();
+ let new_interval = self.settings.collection_interval_or_default();
- if old_settings.collection_interval_or_default() != interval {
- log::info!(
- "metric collection interval changed to {} seconds, reloading timer",
- interval
- );
- timer = Self::setup_timer(interval);
+ if current_interval != new_interval {
+ (timer, first_tick) = Self::setup_timer(new_interval);
+ // If we change (and therefore reset) our timer right before it fires,
+ // we could potentially miss one collection event.
+ // Therefore fetch all remotes which would be due for metric collection before
+ // the new timer fires.
+ if let Some(remote_config) = Self::load_remote_config() {
+ self.fetch_overdue(&remote_config, first_tick).await;
+ }
}
if let Err(err) = self.state.save() {
@@ -186,12 +195,16 @@ impl MetricCollectionTask {
/// Set up a [`tokio::time::Interval`] instance with the provided interval.
/// The timer will be aligned, e.g. an interval of `60` will let the timer
/// fire at minute boundaries.
- fn setup_timer(interval: u64) -> Interval {
+ ///
+ /// The return values are a tuple of the [`tokio::time::Interval`] timer instance
+ /// and the [`std::time::Instant`] at which the timer first fires.
+ fn setup_timer(interval: u64) -> (Interval, Instant) {
+ log::debug!("setting metric collection interval timer to {interval} seconds.",);
let mut timer = tokio::time::interval(Duration::from_secs(interval));
- let first_run = task_utils::next_aligned_instant(interval).into();
- timer.reset_at(first_run);
+ let first_run = task_utils::next_aligned_instant(interval);
+ timer.reset_at(first_run.into());
- timer
+ (timer, first_run)
}
/// Convenience helper to load `remote.cfg`, logging the error
@@ -270,6 +283,41 @@ impl MetricCollectionTask {
}
}
+ /// Fetch metric data from remotes which are overdue for collection.
+ ///
+ /// Use this on startup of the metric collection loop as well as
+ /// when the collection interval changes.
+ async fn fetch_overdue(
+ &mut self,
+ remote_config: &SectionConfigData<Remote>,
+ next_run: Instant,
+ ) {
+ let left_until_scheduled = next_run - Instant::now();
+ let now = proxmox_time::epoch_i64();
+
+ let mut overdue = Vec::new();
+
+ for remote in &remote_config.order {
+ let last_collection = self
+ .state
+ .get_status(remote)
+ .and_then(|s| s.last_collection)
+ .unwrap_or(0);
+
+ let diff = now - last_collection;
+
+ if diff + left_until_scheduled.as_secs() as i64
+ > self.settings.collection_interval_or_default() as i64
+ {
+ log::debug!(
+ "starting metric collection for remote '{remote}' - triggered because collection is overdue"
+ );
+ overdue.push(remote.clone());
+ }
+ }
+ self.fetch_remotes(remote_config, &overdue).await;
+ }
+
/// Fetch a single remote.
#[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
async fn fetch_single_remote(
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 11/28] metric collection: add tests for the fetch_remotes function
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (9 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 10/28] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 12/28] metric collection: add test for fetch_overdue Lukas Wagner
` (18 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
Since we have the option to have mocked remote clients and a mocked
remote config, as well as due to the decoupling of the metric collection
and RRD task, it is now viable to write a unit test for one of the core
pieces of the metric collection system.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 228 ++++++++++++++++++
server/src/metric_collection/state.rs | 16 +-
2 files changed, 232 insertions(+), 12 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 9467175b..2b4130fe 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -405,3 +405,231 @@ pub(super) fn load_state() -> Result<MetricCollectionState, Error> {
file_options,
))
}
+
+#[cfg(test)]
+pub(super) mod tests {
+ use std::{collections::HashMap, sync::Once};
+
+ use anyhow::bail;
+ use http::StatusCode;
+
+ use pdm_api_types::Authid;
+ use pve_api_types::{client::PveClient, ClusterMetrics, ClusterMetricsData};
+
+ use crate::{
+ connection::ClientFactory, metric_collection::rrd_task::RrdStoreResult,
+ pbs_client::PbsClient, test_support::temp::NamedTempFile,
+ };
+
+ use super::*;
+
+ pub(crate) fn get_create_options() -> CreateOptions {
+ CreateOptions::new()
+ .owner(nix::unistd::Uid::effective())
+ .group(nix::unistd::Gid::effective())
+ .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
+ }
+
+ struct TestClientFactory {
+ now: i64,
+ }
+
+ #[async_trait::async_trait]
+ impl ClientFactory for TestClientFactory {
+ fn make_pve_client(
+ &self,
+ remote: &Remote,
+ ) -> Result<Box<dyn PveClient + Send + Sync>, Error> {
+ Ok(Box::new(TestPveClient {
+ fail: remote.id.contains("fail"),
+ now: self.now,
+ }))
+ }
+ /// Create a new API client for PVE remotes, but with a specific endpoint.
+ fn make_pve_client_with_endpoint(
+ &self,
+ _remote: &Remote,
+ _target_endpoint: Option<&str>,
+ ) -> Result<Box<dyn PveClient + Send + Sync>, Error> {
+ bail!("not implemented")
+ }
+
+ fn make_pbs_client(&self, _remote: &Remote) -> Result<Box<PbsClient>, Error> {
+ bail!("not implemented")
+ }
+
+ async fn make_pve_client_and_login(
+ &self,
+ _remote: &Remote,
+ ) -> Result<Box<dyn PveClient + Send + Sync>, Error> {
+ bail!("not implemented")
+ }
+
+ async fn make_pbs_client_and_login(
+ &self,
+ _remote: &Remote,
+ ) -> Result<Box<PbsClient>, Error> {
+ bail!("not implemented")
+ }
+ }
+
+ struct TestPveClient {
+ now: i64,
+ fail: bool,
+ }
+
+ #[async_trait::async_trait]
+ impl PveClient for TestPveClient {
+ /// Retrieve metrics of the cluster.
+ async fn cluster_metrics_export(
+ &self,
+ _history: Option<bool>,
+ _local_only: Option<bool>,
+ start_time: Option<i64>,
+ ) -> Result<ClusterMetrics, proxmox_client::Error> {
+ if self.fail {
+ return Err(proxmox_client::Error::Api(
+ StatusCode::INTERNAL_SERVER_ERROR,
+ "internal server error".into(),
+ ));
+ }
+
+ let mut time = start_time.unwrap_or(0);
+ time = time.max(self.now - (30 * 60));
+ let mut data = Vec::new();
+
+ use pve_api_types::ClusterMetricsDataType::*;
+
+ while time < self.now {
+ let point = |id: &str, metric: &str, timestamp, ty| ClusterMetricsData {
+ id: id.into(),
+ metric: metric.into(),
+ timestamp,
+ ty,
+ value: 0.1,
+ };
+
+ for i in 0..5 {
+ let id = format!("node/node-{i}");
+ data.push(point(&id, "cpu_current", time, Gauge));
+ }
+
+ // Advance time by 10 seconds
+ time += 10;
+ }
+
+ Ok(ClusterMetrics { data })
+ }
+ }
+
+ fn make_remote_config() -> SectionConfigData<Remote> {
+ let mut order = Vec::new();
+ let mut sections = HashMap::new();
+
+ for i in 0..4 {
+ let status = if i >= 2 { "fail" } else { "pass" };
+ let name = format!("pve-{i}-{status}");
+
+ sections.insert(
+ name.clone(),
+ Remote {
+ ty: pdm_api_types::remotes::RemoteType::Pve,
+ id: name.clone(),
+ nodes: Vec::new(),
+ authid: Authid::root_auth_id().clone(),
+ token: "".into(),
+ web_url: None,
+ },
+ );
+
+ order.push(name);
+ }
+
+ SectionConfigData { sections, order }
+ }
+
+ async fn fake_rrd_task(mut rx: Receiver<RrdStoreRequest>) -> u32 {
+ let mut number_of_requests = 0;
+
+ while let Some(request) = rx.recv().await {
+ number_of_requests += 1;
+
+ if let RrdStoreRequest::Pve {
+ metrics, channel, ..
+ } = request
+ {
+ let most_recent_timestamp =
+ metrics.data.iter().map(|e| e.timestamp).max().unwrap_or(0);
+
+ let _ = channel.send(RrdStoreResult {
+ most_recent_timestamp,
+ });
+ }
+ }
+
+ number_of_requests
+ }
+
+ static START: Once = Once::new();
+
+ fn test_init() -> i64 {
+ let now = 10000;
+ START.call_once(|| {
+ // TODO: the client factory is currently stored in a OnceLock -
+ // we can only set it from one test... Ideally we'd like to have the
+ // option to set it in every single test if needed - task/thread local?
+ connection::init(Box::new(TestClientFactory { now }));
+ });
+
+ now
+ }
+
+ #[tokio::test]
+ async fn test_fetch_remotes_updates_state() {
+ // Arrange
+ let now = test_init();
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let handle = tokio::task::spawn(fake_rrd_task(rx));
+
+ let settings = CollectionSettings::default();
+ let config = make_remote_config();
+
+ let state_file = NamedTempFile::new(get_create_options()).unwrap();
+ let state = MetricCollectionState::new(state_file.path().into(), get_create_options());
+
+ let (_control_tx, control_rx) = tokio::sync::mpsc::channel(10);
+
+ let mut task = MetricCollectionTask {
+ state,
+ settings,
+ metric_data_tx: tx,
+ control_message_rx: control_rx,
+ };
+
+ // Act
+ task.fetch_remotes(&config, &config.order).await;
+
+ // Assert
+ for remote in config.order {
+ let status = task.state.get_status(&remote).unwrap();
+
+ // Our faked PVE client will return an error if the remote name contains
+ // 'fail'.
+ if remote.contains("fail") {
+ assert!(status
+ .error
+ .as_ref()
+ .unwrap()
+ .contains("internal server error"));
+ assert_eq!(status.last_collection, None);
+ } else {
+ assert!(now - status.most_recent_datapoint <= 10);
+ assert!(status.error.is_none());
+ }
+ }
+
+ drop(task);
+ assert_eq!(handle.await.unwrap(), 2);
+ }
+}
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
index 5b04ea61..86375a65 100644
--- a/server/src/metric_collection/state.rs
+++ b/server/src/metric_collection/state.rs
@@ -87,23 +87,15 @@ impl MetricCollectionState {
#[cfg(test)]
mod tests {
- use proxmox_sys::fs::CreateOptions;
-
- use super::*;
-
+ use crate::metric_collection::collection_task::tests::get_create_options;
use crate::test_support::temp::NamedTempFile;
- fn get_options() -> CreateOptions {
- CreateOptions::new()
- .owner(nix::unistd::Uid::effective())
- .group(nix::unistd::Gid::effective())
- .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
- }
+ use super::*;
#[test]
fn save_and_load() -> Result<(), Error> {
- let file = NamedTempFile::new(get_options())?;
- let options = get_options();
+ let file = NamedTempFile::new(get_create_options())?;
+ let options = get_create_options();
let mut state = MetricCollectionState::new(file.path().into(), options.clone());
state.set_status(
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 12/28] metric collection: add test for fetch_overdue
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (10 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 11/28] metric collection: add tests for the fetch_remotes function Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 13/28] metric collection: pass rrd cache instance as function parameter Lukas Wagner
` (17 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This test ensure that the logic for fetching metrics from remotes which
are overdue for collection continues to work as expected.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 62 +++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 2b4130fe..f1027ccd 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -632,4 +632,66 @@ pub(super) mod tests {
drop(task);
assert_eq!(handle.await.unwrap(), 2);
}
+
+ #[tokio::test]
+ async fn test_fetch_overdue() {
+ // Arrange
+ test_init();
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let handle = tokio::task::spawn(fake_rrd_task(rx));
+
+ let settings = CollectionSettings {
+ collection_interval: Some(60),
+ ..Default::default()
+ };
+
+ let config = make_remote_config();
+
+ let state_file = NamedTempFile::new(get_create_options()).unwrap();
+ let mut state = MetricCollectionState::new(state_file.path().into(), get_create_options());
+
+ let now = proxmox_time::epoch_i64();
+
+ // This one should be fetched
+ state.set_status(
+ "pve-0-pass".into(),
+ RemoteStatus {
+ last_collection: Some(now - 35),
+ ..Default::default()
+ },
+ );
+ // This one should *not* be fetched
+ state.set_status(
+ "pve-1-pass".into(),
+ RemoteStatus {
+ last_collection: Some(now - 25),
+ ..Default::default()
+ },
+ );
+
+ let (_control_tx, control_rx) = tokio::sync::mpsc::channel(10);
+
+ let mut task = MetricCollectionTask {
+ state,
+ settings,
+ metric_data_tx: tx,
+ control_message_rx: control_rx,
+ };
+
+ let next_collection = Instant::now() + Duration::from_secs(30);
+
+ // Act
+ task.fetch_overdue(&config, next_collection).await;
+
+ // Assert
+ let status = task.state.get_status("pve-0-pass").unwrap();
+ assert!(status.last_collection.unwrap() - now >= 0);
+
+ let status = task.state.get_status("pve-1-pass").unwrap();
+ assert_eq!(status.last_collection.unwrap(), now - 25);
+
+ drop(task);
+ assert_eq!(handle.await.unwrap(), 1);
+ }
}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 13/28] metric collection: pass rrd cache instance as function parameter
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (11 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 12/28] metric collection: add test for fetch_overdue Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 14/28] metric collection: add test for rrd task Lukas Wagner
` (16 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This enables us to do dependency injection for tests by passing in the
rrd cache to use, instead of having to replace the global static
instance. The latter is always awkward because tests might run
multi-threaded, so the global instance could/should be a thread-local,
but this again is weird in tokio-world where we actually want task-local
variables, which in turn are weird once you actually start new tasks,
which then don't access the same task-local variables as their parent
task... - long story short, passing in the dependency as a parameter
makes things easier.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/api/rrd_common.rs | 4 +-
server/src/metric_collection/mod.rs | 15 +++-
server/src/metric_collection/rrd_cache.rs | 77 ++++++++++++--------
server/src/metric_collection/rrd_task.rs | 16 ++--
server/src/metric_collection/top_entities.rs | 3 +
5 files changed, 75 insertions(+), 40 deletions(-)
diff --git a/server/src/api/rrd_common.rs b/server/src/api/rrd_common.rs
index 0d82d0c3..d9ed017a 100644
--- a/server/src/api/rrd_common.rs
+++ b/server/src/api/rrd_common.rs
@@ -23,9 +23,11 @@ pub fn create_datapoints_from_rrd<T: DataPoint>(
let mut timemap = BTreeMap::new();
let mut last_resolution = None;
+ let cache = rrd_cache::get_cache();
+
for name in T::fields() {
let (start, resolution, data) =
- match rrd_cache::extract_data(basedir, name, timeframe, mode)? {
+ match rrd_cache::extract_data(&cache, basedir, name, timeframe, mode)? {
Some(data) => data.into(),
None => continue,
};
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 9cd60455..509d4f88 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -4,6 +4,8 @@ use std::sync::OnceLock;
use anyhow::{bail, Error};
use tokio::sync::mpsc::{self, Sender};
+use proxmox_sys::fs::CreateOptions;
+
mod collection_task;
pub mod rrd_cache;
mod rrd_task;
@@ -16,7 +18,14 @@ static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
/// Initialize the RRD cache
pub fn init() -> Result<(), Error> {
- rrd_cache::init()?;
+ let api_uid = pdm_config::api_user()?.uid;
+ let api_gid = pdm_config::api_group()?.gid;
+
+ let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
+ let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
+
+ let cache = rrd_cache::init(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
+ rrd_cache::set_cache(cache)?;
Ok(())
}
@@ -42,8 +51,10 @@ pub fn start_task() -> Result<(), Error> {
futures::future::select(metric_collection_task_future, abort_future).await;
});
+ let cache = rrd_cache::get_cache();
+
tokio::spawn(async move {
- let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
+ let rrd_task_future = pin!(rrd_task::store_in_rrd_task(cache, metric_data_rx));
let abort_future = pin!(proxmox_daemon::shutdown_future());
futures::future::select(rrd_task_future, abort_future).await;
});
diff --git a/server/src/metric_collection/rrd_cache.rs b/server/src/metric_collection/rrd_cache.rs
index a8d72b80..70c91ca6 100644
--- a/server/src/metric_collection/rrd_cache.rs
+++ b/server/src/metric_collection/rrd_cache.rs
@@ -5,6 +5,7 @@
//! and update RRD data inside `proxmox-datacenter-api`.
use std::path::Path;
+use std::sync::Arc;
use anyhow::{format_err, Error};
use once_cell::sync::OnceCell;
@@ -16,32 +17,45 @@ use proxmox_sys::fs::CreateOptions;
use pdm_buildcfg::PDM_STATE_DIR_M;
-const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
+pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
-static RRD_CACHE: OnceCell<Cache> = OnceCell::new();
+// This is an `Arc` because this makes it easier to do dependency injection
+// in test contexts.
+//
+// For DI in testing, we want to pass in a reference to the Cache
+// as a function parameter. In a couple of these functions we
+// spawn tokio tasks which need access to the reference, hence the
+// reference needs to be 'static. In a test context, we kind of have a
+// hard time to come up with a 'static reference, so we just
+// wrap the cache in an `Arc` for now, solving the
+// lifetime problem via refcounting.
+static RRD_CACHE: OnceCell<Arc<Cache>> = OnceCell::new();
/// Get the RRD cache instance
-fn get_cache() -> Result<&'static Cache, Error> {
+pub fn get_cache() -> Arc<Cache> {
+ RRD_CACHE.get().cloned().expect("rrd cache not initialized")
+}
+
+pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
RRD_CACHE
- .get()
- .ok_or_else(|| format_err!("RRD cache not initialized!"))
+ .set(cache)
+ .map_err(|_| format_err!("RRD cache already initialized!"))?;
+
+ Ok(())
}
/// Initialize the RRD cache instance
///
/// Note: Only a single process must do this (proxmox-datacenter-api)
-pub fn init() -> Result<&'static Cache, Error> {
- let api_uid = pdm_config::api_user()?.uid;
- let api_gid = pdm_config::api_group()?.gid;
-
- let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
- let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
+pub fn init<P: AsRef<Path>>(
+ base_path: P,
+ dir_options: CreateOptions,
+ file_options: CreateOptions,
+) -> Result<Arc<Cache>, Error> {
let apply_interval = 30.0 * 60.0; // 30 minutes
let cache = Cache::new(
- RRD_CACHE_BASEDIR,
+ base_path,
Some(file_options),
Some(dir_options),
apply_interval,
@@ -51,11 +65,7 @@ pub fn init() -> Result<&'static Cache, Error> {
cache.apply_journal()?;
- RRD_CACHE
- .set(cache)
- .map_err(|_| format_err!("RRD cache already initialized!"))?;
-
- Ok(RRD_CACHE.get().unwrap())
+ Ok(Arc::new(cache))
}
fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
@@ -91,6 +101,7 @@ fn create_callback(dst: DataSourceType) -> Database {
/// Extracts data for the specified time frame from RRD cache
pub fn extract_data(
+ rrd_cache: &Cache,
basedir: &str,
name: &str,
timeframe: RrdTimeframe,
@@ -112,26 +123,28 @@ pub fn extract_data(
RrdMode::Average => AggregationFn::Average,
};
- let rrd_cache = get_cache()?;
-
rrd_cache.extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
}
/// Sync/Flush the RRD journal
pub fn sync_journal() {
- if let Ok(rrd_cache) = get_cache() {
- if let Err(err) = rrd_cache.sync_journal() {
- log::error!("rrd_sync_journal failed - {err}");
- }
+ let rrd_cache = get_cache();
+ if let Err(err) = rrd_cache.sync_journal() {
+ log::error!("rrd_sync_journal failed - {err}");
}
}
+
/// Update RRD Gauge values
-pub fn update_value(name: &str, value: f64, timestamp: i64, datasource_type: DataSourceType) {
- if let Ok(rrd_cache) = get_cache() {
- if let Err(err) =
- rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
- {
- log::error!("rrd::update_value '{name}' failed - {err}");
- }
+pub fn update_value(
+ rrd_cache: &Cache,
+ name: &str,
+ value: f64,
+ timestamp: i64,
+ datasource_type: DataSourceType,
+) {
+ if let Err(err) =
+ rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
+ {
+ log::error!("rrd::update_value '{name}' failed - {err}");
}
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 1c618f54..3704b0e7 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -1,7 +1,9 @@
+use std::sync::Arc;
+
use anyhow::Error;
use tokio::sync::{mpsc::Receiver, oneshot};
-use proxmox_rrd::rrd::DataSourceType;
+use proxmox_rrd::{rrd::DataSourceType, Cache};
use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
@@ -39,9 +41,11 @@ pub(super) struct RrdStoreResult {
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
+ cache: Arc<Cache>,
mut receiver: Receiver<RrdStoreRequest>,
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
+ let cache_clone = Arc::clone(&cache);
// Involves some blocking file IO
let res = tokio::task::spawn_blocking(move || {
let mut most_recent_timestamp = 0;
@@ -53,7 +57,7 @@ pub(super) async fn store_in_rrd_task(
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
- store_metric_pve(&remote, &data_point);
+ store_metric_pve(&cache_clone, &remote, &data_point);
}
channel
@@ -65,7 +69,7 @@ pub(super) async fn store_in_rrd_task(
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
- store_metric_pbs(&remote, &data_point);
+ store_metric_pbs(&cache_clone, &remote, &data_point);
}
channel
@@ -91,7 +95,7 @@ pub(super) async fn store_in_rrd_task(
Ok(())
}
-fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
+fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetricsData) {
let name = format!(
"pve/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -105,6 +109,7 @@ fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
};
rrd_cache::update_value(
+ cache,
&name,
data_point.value,
data_point.timestamp,
@@ -112,7 +117,7 @@ fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
);
}
-fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
+fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoint) {
let name = format!(
"pbs/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -126,6 +131,7 @@ fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
};
rrd_cache::update_value(
+ cache,
&name,
data_point.value,
data_point.timestamp,
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
index f8e053fb..f54ee72f 100644
--- a/server/src/metric_collection/top_entities.rs
+++ b/server/src/metric_collection/top_entities.rs
@@ -121,7 +121,10 @@ fn get_entity(
name: String,
metric: &str,
) -> Option<(f64, TopEntity)> {
+ let cache = rrd_cache::get_cache();
+
if let Ok(Some(values)) = rrd_cache::extract_data(
+ &cache,
&name,
metric,
timeframe,
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 14/28] metric collection: add test for rrd task
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (12 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 13/28] metric collection: pass rrd cache instance as function parameter Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 15/28] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
` (15 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This commit adds a quick smoke-test for the `store_in_rrd` task
function. Unfortunately, due to some race-condition in proxmox_rrd,
some extra measures had to be taken in order to make the test not flaky.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/metric_collection/rrd_task.rs | 92 ++++++++++++++++++++++++
1 file changed, 92 insertions(+)
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 3704b0e7..27327a29 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -138,3 +138,95 @@ fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoi
data_source_type,
);
}
+
+#[cfg(test)]
+mod tests {
+ use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
+ use pve_api_types::{ClusterMetrics, ClusterMetricsData};
+
+ use crate::{
+ metric_collection::collection_task::tests::get_create_options,
+ test_support::temp::NamedTempDir,
+ };
+
+ use super::*;
+
+ #[tokio::test]
+ async fn test_rrd_task_persists_data() -> Result<(), Error> {
+ // Arrange
+ let dir = NamedTempDir::new()?;
+ let options = get_create_options().perm(nix::sys::stat::Mode::from_bits_truncate(0o700));
+ let cache = rrd_cache::init(&dir.path(), options.clone(), options.clone())?;
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let task = store_in_rrd_task(Arc::clone(&cache), rx);
+ let handle = tokio::task::spawn(task);
+
+ let now = proxmox_time::epoch_i64();
+
+ let metrics = ClusterMetrics {
+ data: vec![
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 30,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.1,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 20,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.2,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 10,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.1,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.2,
+ },
+ ],
+ };
+ let (tx_back, rx_back) = tokio::sync::oneshot::channel();
+ let request = RrdStoreRequest::Pve {
+ remote: "some-remote".into(),
+ metrics,
+ channel: tx_back,
+ };
+
+ // Act
+ tx.send(request).await?;
+ let result = rx_back.await?;
+
+ // Assert
+ assert_eq!(result.most_recent_timestamp, now);
+
+ drop(tx);
+ handle.await??;
+
+ // There is some race condition in proxmox_rrd, in some rare cases
+ // extract_data does not return any data directly after writing.
+ if let Some(data) = rrd_cache::extract_data(
+ &cache,
+ "pve/some-remote/node/some-node",
+ "cpu_current",
+ RrdTimeframe::Hour,
+ RrdMode::Max,
+ )? {
+ // Only assert that there are some data points, the exact position in the vec
+ // might vary due to changed boundaries.
+ assert!(data.data.iter().any(Option::is_some));
+ }
+
+ Ok(())
+ }
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 15/28] metric collection: wrap rrd_cache::Cache in a struct
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (13 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 14/28] metric collection: add test for rrd task Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 16/28] metric collection: record remote response time in metric database Lukas Wagner
` (14 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
All helper functions storing/retrieving helper functions take the cache
instance as a first parameter. This smells like it should be a method on
a struct.
This commit wraps the foreign `proxmox_rrd::Cache` type in a new type,
transforms the `init` function into a `new` method and all other helpers
into further methods.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/api/rrd_common.rs | 9 +-
server/src/metric_collection/mod.rs | 6 +-
server/src/metric_collection/rrd_cache.rs | 191 +++++++++----------
server/src/metric_collection/rrd_task.rs | 21 +-
server/src/metric_collection/top_entities.rs | 3 +-
5 files changed, 113 insertions(+), 117 deletions(-)
diff --git a/server/src/api/rrd_common.rs b/server/src/api/rrd_common.rs
index d9ed017a..28868bc1 100644
--- a/server/src/api/rrd_common.rs
+++ b/server/src/api/rrd_common.rs
@@ -26,11 +26,10 @@ pub fn create_datapoints_from_rrd<T: DataPoint>(
let cache = rrd_cache::get_cache();
for name in T::fields() {
- let (start, resolution, data) =
- match rrd_cache::extract_data(&cache, basedir, name, timeframe, mode)? {
- Some(data) => data.into(),
- None => continue,
- };
+ let (start, resolution, data) = match cache.extract_data(basedir, name, timeframe, mode)? {
+ Some(data) => data.into(),
+ None => continue,
+ };
if let Some(expected_resolution) = last_resolution {
if resolution != expected_resolution {
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 509d4f88..5b6c14d2 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -1,4 +1,5 @@
use std::pin::pin;
+use std::sync::Arc;
use std::sync::OnceLock;
use anyhow::{bail, Error};
@@ -13,6 +14,7 @@ mod state;
pub mod top_entities;
use collection_task::{ControlMsg, MetricCollectionTask};
+use rrd_cache::RrdCache;
static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
@@ -24,8 +26,8 @@ pub fn init() -> Result<(), Error> {
let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
- let cache = rrd_cache::init(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
- rrd_cache::set_cache(cache)?;
+ let cache = RrdCache::new(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
+ rrd_cache::set_cache(Arc::new(cache))?;
Ok(())
}
diff --git a/server/src/metric_collection/rrd_cache.rs b/server/src/metric_collection/rrd_cache.rs
index 70c91ca6..e8994143 100644
--- a/server/src/metric_collection/rrd_cache.rs
+++ b/server/src/metric_collection/rrd_cache.rs
@@ -29,14 +29,14 @@ pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
// hard time to come up with a 'static reference, so we just
// wrap the cache in an `Arc` for now, solving the
// lifetime problem via refcounting.
-static RRD_CACHE: OnceCell<Arc<Cache>> = OnceCell::new();
+static RRD_CACHE: OnceCell<Arc<RrdCache>> = OnceCell::new();
/// Get the RRD cache instance
-pub fn get_cache() -> Arc<Cache> {
+pub fn get_cache() -> Arc<RrdCache> {
RRD_CACHE.get().cloned().expect("rrd cache not initialized")
}
-pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
+pub fn set_cache(cache: Arc<RrdCache>) -> Result<(), Error> {
RRD_CACHE
.set(cache)
.map_err(|_| format_err!("RRD cache already initialized!"))?;
@@ -44,107 +44,106 @@ pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
Ok(())
}
-/// Initialize the RRD cache instance
-///
-/// Note: Only a single process must do this (proxmox-datacenter-api)
-pub fn init<P: AsRef<Path>>(
- base_path: P,
- dir_options: CreateOptions,
- file_options: CreateOptions,
-) -> Result<Arc<Cache>, Error> {
- let apply_interval = 30.0 * 60.0; // 30 minutes
-
- let cache = Cache::new(
- base_path,
- Some(file_options),
- Some(dir_options),
- apply_interval,
- load_callback,
- create_callback,
- )?;
-
- cache.apply_journal()?;
-
- Ok(Arc::new(cache))
+/// Wrapper for proxmox_rrd::Cache to accomodate helper methods.
+pub struct RrdCache {
+ cache: Cache,
}
-fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
- match Database::load(path, true) {
- Ok(rrd) => Some(rrd),
- Err(err) => {
- if err.kind() != std::io::ErrorKind::NotFound {
- log::warn!("overwriting RRD file {path:?}, because of load error: {err}",);
+impl RrdCache {
+ /// Create a new RrdCache instance
+ pub fn new<P: AsRef<Path>>(
+ base_path: P,
+ dir_options: CreateOptions,
+ file_options: CreateOptions,
+ ) -> Result<Self, Error> {
+ let apply_interval = 30.0 * 60.0; // 30 minutes
+
+ let cache = Cache::new(
+ base_path,
+ Some(file_options),
+ Some(dir_options),
+ apply_interval,
+ Self::load_callback,
+ Self::create_callback,
+ )?;
+
+ cache.apply_journal()?;
+
+ Ok(Self { cache })
+ }
+
+ fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
+ match Database::load(path, true) {
+ Ok(rrd) => Some(rrd),
+ Err(err) => {
+ if err.kind() != std::io::ErrorKind::NotFound {
+ log::warn!("overwriting RRD file {path:?}, because of load error: {err}",);
+ }
+ None
}
- None
}
}
-}
-
-fn create_callback(dst: DataSourceType) -> Database {
- let rra_list = vec![
- // 1 min * 1440 => 1 day
- Archive::new(AggregationFn::Average, 60, 1440),
- Archive::new(AggregationFn::Maximum, 60, 1440),
- // 30 min * 1440 => 30 days ~ 1 month
- Archive::new(AggregationFn::Average, 30 * 60, 1440),
- Archive::new(AggregationFn::Maximum, 30 * 60, 1440),
- // 6 h * 1440 => 360 days ~ 1 year
- Archive::new(AggregationFn::Average, 6 * 3600, 1440),
- Archive::new(AggregationFn::Maximum, 6 * 3600, 1440),
- // 1 week * 570 => 10 years
- Archive::new(AggregationFn::Average, 7 * 86400, 570),
- Archive::new(AggregationFn::Maximum, 7 * 86400, 570),
- ];
-
- Database::new(dst, rra_list)
-}
-/// Extracts data for the specified time frame from RRD cache
-pub fn extract_data(
- rrd_cache: &Cache,
- basedir: &str,
- name: &str,
- timeframe: RrdTimeframe,
- mode: RrdMode,
-) -> Result<Option<proxmox_rrd::Entry>, Error> {
- let end = proxmox_time::epoch_f64() as u64;
-
- let (start, resolution) = match timeframe {
- RrdTimeframe::Hour => (end - 3600, 60),
- RrdTimeframe::Day => (end - 3600 * 24, 60),
- RrdTimeframe::Week => (end - 3600 * 24 * 7, 30 * 60),
- RrdTimeframe::Month => (end - 3600 * 24 * 30, 30 * 60),
- RrdTimeframe::Year => (end - 3600 * 24 * 365, 6 * 60 * 60),
- RrdTimeframe::Decade => (end - 10 * 3600 * 24 * 366, 7 * 86400),
- };
-
- let cf = match mode {
- RrdMode::Max => AggregationFn::Maximum,
- RrdMode::Average => AggregationFn::Average,
- };
-
- rrd_cache.extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
-}
+ fn create_callback(dst: DataSourceType) -> Database {
+ let rra_list = vec![
+ // 1 min * 1440 => 1 day
+ Archive::new(AggregationFn::Average, 60, 1440),
+ Archive::new(AggregationFn::Maximum, 60, 1440),
+ // 30 min * 1440 => 30 days ~ 1 month
+ Archive::new(AggregationFn::Average, 30 * 60, 1440),
+ Archive::new(AggregationFn::Maximum, 30 * 60, 1440),
+ // 6 h * 1440 => 360 days ~ 1 year
+ Archive::new(AggregationFn::Average, 6 * 3600, 1440),
+ Archive::new(AggregationFn::Maximum, 6 * 3600, 1440),
+ // 1 week * 570 => 10 years
+ Archive::new(AggregationFn::Average, 7 * 86400, 570),
+ Archive::new(AggregationFn::Maximum, 7 * 86400, 570),
+ ];
+
+ Database::new(dst, rra_list)
+ }
-/// Sync/Flush the RRD journal
-pub fn sync_journal() {
- let rrd_cache = get_cache();
- if let Err(err) = rrd_cache.sync_journal() {
- log::error!("rrd_sync_journal failed - {err}");
+ /// Extracts data for the specified time frame from RRD cache
+ pub fn extract_data(
+ &self,
+ basedir: &str,
+ name: &str,
+ timeframe: RrdTimeframe,
+ mode: RrdMode,
+ ) -> Result<Option<proxmox_rrd::Entry>, Error> {
+ let end = proxmox_time::epoch_f64() as u64;
+
+ let (start, resolution) = match timeframe {
+ RrdTimeframe::Hour => (end - 3600, 60),
+ RrdTimeframe::Day => (end - 3600 * 24, 60),
+ RrdTimeframe::Week => (end - 3600 * 24 * 7, 30 * 60),
+ RrdTimeframe::Month => (end - 3600 * 24 * 30, 30 * 60),
+ RrdTimeframe::Year => (end - 3600 * 24 * 365, 6 * 60 * 60),
+ RrdTimeframe::Decade => (end - 10 * 3600 * 24 * 366, 7 * 86400),
+ };
+
+ let cf = match mode {
+ RrdMode::Max => AggregationFn::Maximum,
+ RrdMode::Average => AggregationFn::Average,
+ };
+
+ self.cache
+ .extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
}
-}
-/// Update RRD Gauge values
-pub fn update_value(
- rrd_cache: &Cache,
- name: &str,
- value: f64,
- timestamp: i64,
- datasource_type: DataSourceType,
-) {
- if let Err(err) =
- rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
- {
- log::error!("rrd::update_value '{name}' failed - {err}");
+ /// Update RRD Gauge values
+ pub fn update_value(
+ &self,
+ name: &str,
+ value: f64,
+ timestamp: i64,
+ datasource_type: DataSourceType,
+ ) {
+ if let Err(err) =
+ self.cache
+ .update_value_ignore_old(name, timestamp as f64, value, datasource_type)
+ {
+ log::error!("rrd::update_value '{name}' failed - {err}");
+ }
}
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 27327a29..c2a41d30 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -3,12 +3,12 @@ use std::sync::Arc;
use anyhow::Error;
use tokio::sync::{mpsc::Receiver, oneshot};
-use proxmox_rrd::{rrd::DataSourceType, Cache};
+use proxmox_rrd::rrd::DataSourceType;
use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
-use super::rrd_cache;
+use super::rrd_cache::RrdCache;
/// Store request for the RRD task.
pub(super) enum RrdStoreRequest {
@@ -41,7 +41,7 @@ pub(super) struct RrdStoreResult {
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
- cache: Arc<Cache>,
+ cache: Arc<RrdCache>,
mut receiver: Receiver<RrdStoreRequest>,
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
@@ -95,7 +95,7 @@ pub(super) async fn store_in_rrd_task(
Ok(())
}
-fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetricsData) {
+fn store_metric_pve(cache: &RrdCache, remote_name: &str, data_point: &ClusterMetricsData) {
let name = format!(
"pve/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -108,8 +108,7 @@ fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetric
ClusterMetricsDataType::Derive => DataSourceType::Derive,
};
- rrd_cache::update_value(
- cache,
+ cache.update_value(
&name,
data_point.value,
data_point.timestamp,
@@ -117,7 +116,7 @@ fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetric
);
}
-fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoint) {
+fn store_metric_pbs(cache: &RrdCache, remote_name: &str, data_point: &MetricDataPoint) {
let name = format!(
"pbs/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -130,8 +129,7 @@ fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoi
MetricDataType::Derive => DataSourceType::Derive,
};
- rrd_cache::update_value(
- cache,
+ cache.update_value(
&name,
data_point.value,
data_point.timestamp,
@@ -156,7 +154,7 @@ mod tests {
// Arrange
let dir = NamedTempDir::new()?;
let options = get_create_options().perm(nix::sys::stat::Mode::from_bits_truncate(0o700));
- let cache = rrd_cache::init(&dir.path(), options.clone(), options.clone())?;
+ let cache = Arc::new(RrdCache::new(dir.path(), options.clone(), options.clone())?);
let (tx, rx) = tokio::sync::mpsc::channel(10);
let task = store_in_rrd_task(Arc::clone(&cache), rx);
@@ -215,8 +213,7 @@ mod tests {
// There is some race condition in proxmox_rrd, in some rare cases
// extract_data does not return any data directly after writing.
- if let Some(data) = rrd_cache::extract_data(
- &cache,
+ if let Some(data) = cache.extract_data(
"pve/some-remote/node/some-node",
"cpu_current",
RrdTimeframe::Hour,
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
index f54ee72f..31e36c34 100644
--- a/server/src/metric_collection/top_entities.rs
+++ b/server/src/metric_collection/top_entities.rs
@@ -123,8 +123,7 @@ fn get_entity(
) -> Option<(f64, TopEntity)> {
let cache = rrd_cache::get_cache();
- if let Ok(Some(values)) = rrd_cache::extract_data(
- &cache,
+ if let Ok(Some(values)) = cache.extract_data(
&name,
metric,
timeframe,
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 16/28] metric collection: record remote response time in metric database
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (14 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 15/28] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 17/28] metric collection: save time needed for collection run to RRD Lukas Wagner
` (13 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This gives us the ability to retrieve max/avg response times for a given
time window.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 13 +++++++-
server/src/metric_collection/rrd_task.rs | 31 +++++++++++++++++++
2 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index f1027ccd..b58bf2fc 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -336,6 +336,7 @@ impl MetricCollectionTask {
let (result_tx, result_rx) = oneshot::channel();
let now = proxmox_time::epoch_i64();
+ let start = Instant::now();
let res: Result<RrdStoreResult, Error> = async {
match remote.ty {
@@ -349,11 +350,16 @@ impl MetricCollectionTask {
)
.await?;
+ let duration = start.elapsed();
+
sender
.send(RrdStoreRequest::Pve {
remote: remote.id.clone(),
metrics,
channel: result_tx,
+ // TODO: use as_millis_f64 once stabilized
+ response_time: duration.as_secs_f64() * 1000.,
+ request_at: now,
})
.await?;
}
@@ -363,15 +369,20 @@ impl MetricCollectionTask {
.metrics(Some(true), Some(status.most_recent_datapoint))
.await?;
+ let duration = start.elapsed();
+
sender
.send(RrdStoreRequest::Pbs {
remote: remote.id.clone(),
metrics,
channel: result_tx,
+ // TODO: use as_millis_f64 once stabilized
+ response_time: duration.as_secs_f64() * 1000.,
+ request_at: now,
})
.await?;
}
- }
+ };
result_rx.await.map_err(Error::from)
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index c2a41d30..a8e48e89 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -20,6 +20,10 @@ pub(super) enum RrdStoreRequest {
metrics: ClusterMetrics,
/// Oneshot channel to return the [`RrdStoreResult`].
channel: oneshot::Sender<RrdStoreResult>,
+ /// Reponse time in ms for the API request.
+ response_time: f64,
+ /// Timestamp at which the request was done (UNIX epoch).
+ request_at: i64,
},
/// Store PBS metrics.
Pbs {
@@ -29,6 +33,10 @@ pub(super) enum RrdStoreRequest {
metrics: Metrics,
/// Oneshot channel to return the [`RrdStoreResult`].
channel: oneshot::Sender<RrdStoreResult>,
+ /// Reponse time in ms for the API request.
+ response_time: f64,
+ /// Timestamp at which the request was done (UNIX epoch).
+ request_at: i64,
},
}
@@ -54,11 +62,14 @@ pub(super) async fn store_in_rrd_task(
remote,
metrics,
channel,
+ response_time,
+ request_at,
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
store_metric_pve(&cache_clone, &remote, &data_point);
}
+ store_response_time(&cache_clone, &remote, response_time, request_at);
channel
}
@@ -66,11 +77,14 @@ pub(super) async fn store_in_rrd_task(
remote,
metrics,
channel,
+ response_time,
+ request_at,
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
store_metric_pbs(&cache_clone, &remote, &data_point);
}
+ store_response_time(&cache_clone, &remote, response_time, request_at);
channel
}
@@ -137,6 +151,12 @@ fn store_metric_pbs(cache: &RrdCache, remote_name: &str, data_point: &MetricData
);
}
+fn store_response_time(cache: &RrdCache, remote_name: &str, response_time: f64, timestamp: i64) {
+ let name = format!("local/metric-collection/remotes/{remote_name}/response-time");
+
+ cache.update_value(&name, response_time, timestamp, DataSourceType::Gauge);
+}
+
#[cfg(test)]
mod tests {
use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
@@ -199,6 +219,8 @@ mod tests {
remote: "some-remote".into(),
metrics,
channel: tx_back,
+ response_time: 10.0,
+ request_at: now,
};
// Act
@@ -224,6 +246,15 @@ mod tests {
assert!(data.data.iter().any(Option::is_some));
}
+ if let Some(data) = cache.extract_data(
+ "local/metric-collection/remotes/some-remote",
+ "response-time",
+ RrdTimeframe::Hour,
+ RrdMode::Max,
+ )? {
+ assert!(data.data.iter().any(Option::is_some));
+ }
+
Ok(())
}
}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 17/28] metric collection: save time needed for collection run to RRD
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (15 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 16/28] metric collection: record remote response time in metric database Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 18/28] metric collection: periodically clean removed remotes from statefile Lukas Wagner
` (12 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
For large setups, it might be useful to know how much time was needed to
collect metrics for *all* remotes together, e.g. for making sure that
the collection interval is not exceeded.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 14 +++++
server/src/metric_collection/rrd_task.rs | 53 ++++++++++++++-----
2 files changed, 55 insertions(+), 12 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index b58bf2fc..60c62c87 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -22,6 +22,7 @@ use pdm_api_types::{
};
use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
+use crate::metric_collection::rrd_task::CollectionStats;
use crate::{connection, task_utils};
use super::{
@@ -97,8 +98,21 @@ impl MetricCollectionTask {
).await;
if let Some(remotes) = Self::load_remote_config() {
+ let now = Instant::now();
let to_fetch = remotes.order.as_slice();
self.fetch_remotes(&remotes, to_fetch).await;
+ let elapsed = now.elapsed();
+
+ if let Err(err) = self.metric_data_tx.send(
+ RrdStoreRequest::CollectionStats {
+ timestamp: proxmox_time::epoch_i64(),
+ stats: CollectionStats {
+ // TODO: use as_millis_f64 once stabilized
+ total_time: elapsed.as_secs_f64() * 1000.
+ }
+ }).await {
+ log::error!("could not send collection stats to rrd task: {err}");
+ }
}
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index a8e48e89..7d0b95b2 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -38,6 +38,13 @@ pub(super) enum RrdStoreRequest {
/// Timestamp at which the request was done (UNIX epoch).
request_at: i64,
},
+ /// Store collection stats.
+ CollectionStats {
+ /// Timestamp at which the collection took place (UNIX epoch).
+ timestamp: i64,
+ /// Statistics.
+ stats: CollectionStats,
+ },
}
/// Result for a [`RrdStoreRequest`].
@@ -46,6 +53,12 @@ pub(super) struct RrdStoreResult {
pub(super) most_recent_timestamp: i64,
}
+/// Statistics for a (full) metric collection run.
+pub(super) struct CollectionStats {
+ /// Total time in ms.
+ pub(super) total_time: f64,
+}
+
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
@@ -57,7 +70,8 @@ pub(super) async fn store_in_rrd_task(
// Involves some blocking file IO
let res = tokio::task::spawn_blocking(move || {
let mut most_recent_timestamp = 0;
- let channel = match msg {
+
+ match msg {
RrdStoreRequest::Pve {
remote,
metrics,
@@ -71,7 +85,13 @@ pub(super) async fn store_in_rrd_task(
}
store_response_time(&cache_clone, &remote, response_time, request_at);
- channel
+ let result = RrdStoreResult {
+ most_recent_timestamp,
+ };
+
+ if channel.send(result).is_err() {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
}
RrdStoreRequest::Pbs {
remote,
@@ -86,17 +106,17 @@ pub(super) async fn store_in_rrd_task(
}
store_response_time(&cache_clone, &remote, response_time, request_at);
- channel
- }
- };
+ let result = RrdStoreResult {
+ most_recent_timestamp,
+ };
- if channel
- .send(RrdStoreResult {
- most_recent_timestamp,
- })
- .is_err()
- {
- log::error!("could not send RrdStoreStoreResult to metric collection task");
+ if channel.send(result).is_err() {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
+ }
+ RrdStoreRequest::CollectionStats { timestamp, stats } => {
+ store_stats(&cache_clone, &stats, timestamp)
+ }
};
})
.await;
@@ -157,6 +177,15 @@ fn store_response_time(cache: &RrdCache, remote_name: &str, response_time: f64,
cache.update_value(&name, response_time, timestamp, DataSourceType::Gauge);
}
+fn store_stats(cache: &RrdCache, stats: &CollectionStats, timestamp: i64) {
+ cache.update_value(
+ "local/metric-collection/total-time",
+ stats.total_time,
+ timestamp,
+ DataSourceType::Gauge,
+ );
+}
+
#[cfg(test)]
mod tests {
use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 18/28] metric collection: periodically clean removed remotes from statefile
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (16 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 17/28] metric collection: save time needed for collection run to RRD Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 19/28] api: add endpoint for updating metric collection settings Lukas Wagner
` (11 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
Adding and removing remotes can leave leftover data in the statefile,
hence it makes sense to clean it up periodically.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 6 ++++
server/src/metric_collection/state.rs | 32 +++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 60c62c87..2ea735c5 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -98,6 +98,8 @@ impl MetricCollectionTask {
).await;
if let Some(remotes) = Self::load_remote_config() {
+ self.cleanup_removed_remotes_from_state(&remotes);
+
let now = Instant::now();
let to_fetch = remotes.order.as_slice();
self.fetch_remotes(&remotes, to_fetch).await;
@@ -177,6 +179,10 @@ impl MetricCollectionTask {
tokio::time::sleep(Duration::from_millis(jitter)).await;
}
+ fn cleanup_removed_remotes_from_state(&mut self, remotes: &SectionConfigData<Remote>) {
+ self.state.retain(|remote| remotes.get(remote).is_some());
+ }
+
fn get_settings_or_default() -> CollectionSettings {
// We want to fall back to defaults if
// - the config file does not exist (no error should be logged)
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
index 86375a65..a8d49e5d 100644
--- a/server/src/metric_collection/state.rs
+++ b/server/src/metric_collection/state.rs
@@ -74,6 +74,11 @@ impl MetricCollectionState {
Ok(())
}
+ /// Retain all remotes for which the predicate holds.
+ pub fn retain(&mut self, check: impl Fn(&str) -> bool) {
+ self.state.remote_status.retain(|remote, _| check(remote));
+ }
+
fn load_or_default(path: &Path) -> Result<State, Error> {
let content = proxmox_sys::fs::file_read_optional_string(path)?;
@@ -115,4 +120,31 @@ mod tests {
Ok(())
}
+
+ #[test]
+ fn test_retain() -> Result<(), Error> {
+ let file = NamedTempFile::new(get_create_options())?;
+ let options = get_create_options();
+ let mut state = MetricCollectionState::new(file.path().into(), options.clone());
+
+ state.set_status(
+ "remote-1".into(),
+ RemoteStatus {
+ ..Default::default()
+ },
+ );
+ state.set_status(
+ "remote-2".into(),
+ RemoteStatus {
+ ..Default::default()
+ },
+ );
+
+ state.retain(|remote| remote == "remote-1");
+
+ assert!(state.get_status("remote-1").is_some());
+ assert!(state.get_status("remote-2").is_none());
+
+ Ok(())
+ }
}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 19/28] api: add endpoint for updating metric collection settings
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (17 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 18/28] metric collection: periodically clean removed remotes from statefile Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 20/28] api: add endpoint to trigger metric collection Lukas Wagner
` (10 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This one lives under /config/metric-collection.
At the moment, we do not offer full CRUD, but only offer a hard-coded
collection settings instance at /config/metric-collection/default, which
can be retrieved via GET and updated via PUT.
Later, we might update this to full CRUD, e.g. to have different
settings for different 'poll-groups' consisting of a sub-set of remotes.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/api/config/metric_collection.rs | 156 +++++++++++++++++++++
server/src/api/config/mod.rs | 2 +
2 files changed, 158 insertions(+)
create mode 100644 server/src/api/config/metric_collection.rs
diff --git a/server/src/api/config/metric_collection.rs b/server/src/api/config/metric_collection.rs
new file mode 100644
index 00000000..0915179d
--- /dev/null
+++ b/server/src/api/config/metric_collection.rs
@@ -0,0 +1,156 @@
+use anyhow::{bail, Error};
+
+use proxmox_config_digest::ConfigDigest;
+use proxmox_router::{list_subdirs_api_method, Router, RpcEnvironment, SubdirMap};
+use proxmox_schema::api;
+use proxmox_sortable_macro::sortable;
+
+use pdm_api_types::{
+ CollectionSettings, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
+};
+use pdm_config::metric_collection::COLLECTION_SETTINGS_TYPE;
+
+#[sortable]
+const SUBDIRS: SubdirMap = &sorted!([("default", &DEFAULT_ROUTER),]);
+
+pub const ROUTER: Router = Router::new()
+ .get(&list_subdirs_api_method!(SUBDIRS))
+ .subdirs(SUBDIRS);
+
+const DEFAULT_ROUTER: Router = Router::new()
+ .get(&API_METHOD_GET_SETTINGS)
+ .put(&API_METHOD_UPDATE_SETTINGS);
+
+#[api]
+/// Get metric collection settings.
+pub fn get_settings(rpcenv: &mut dyn RpcEnvironment) -> Result<CollectionSettings, Error> {
+ let (config, digest) = pdm_config::metric_collection::config()?;
+ rpcenv["digest"] = digest.to_hex().into();
+
+ if config.sections.contains_key("default") {
+ config.lookup(COLLECTION_SETTINGS_TYPE, "default")
+ } else {
+ Ok(CollectionSettings::new("default"))
+ }
+}
+
+#[api(
+ input: {
+ properties: {
+ updater: {
+ type: CollectionSettingsUpdater,
+ },
+ delete: {
+ type: Array,
+ items: {
+ type: DeletableCollectionSettingsProperty,
+ },
+ description: "Array of properties that should be deleted.",
+ optional: true,
+ },
+ digest: {
+ type: ConfigDigest,
+ optional: true,
+ },
+ },
+ },
+)]
+/// Update metric collection settings.
+pub fn update_settings(
+ updater: CollectionSettingsUpdater,
+ delete: Option<Vec<DeletableCollectionSettingsProperty>>,
+ digest: Option<ConfigDigest>,
+) -> Result<(), Error> {
+ let (mut config, existing_digest) = pdm_config::metric_collection::config()?;
+ if digest.is_some() {
+ existing_digest.detect_modification(digest.as_ref())?;
+ }
+
+ let mut entry = if config.sections.contains_key("default") {
+ config.lookup(COLLECTION_SETTINGS_TYPE, "default")?
+ } else {
+ CollectionSettings::new("default")
+ };
+
+ let mut interval_offset_changed = false;
+ let mut connection_delay_changed = false;
+
+ if let Some(delete) = delete {
+ for property in delete {
+ use DeletableCollectionSettingsProperty::*;
+
+ match property {
+ MaxIntervalOffset => {
+ entry.max_interval_offset = None;
+ interval_offset_changed = true;
+ }
+ MinIntervalOffset => {
+ entry.min_interval_offset = None;
+ interval_offset_changed = true;
+ }
+ MaxConnectionDelay => {
+ entry.max_connection_delay = None;
+ connection_delay_changed = true;
+ }
+ MinConnectionDelay => {
+ entry.min_connection_delay = None;
+ connection_delay_changed = true;
+ }
+ MaxConcurrentConnections => entry.max_concurrent_connections = None,
+ CollectionInterval => entry.collection_interval = None,
+ }
+ }
+ }
+
+ let CollectionSettingsUpdater {
+ collection_interval,
+ max_concurrent_connections,
+ max_interval_offset,
+ min_interval_offset,
+ max_connection_delay,
+ min_connection_delay,
+ } = updater;
+
+ if let Some(value) = collection_interval {
+ entry.collection_interval = Some(value);
+ }
+ if let Some(value) = max_concurrent_connections {
+ entry.max_concurrent_connections = Some(value);
+ }
+ if let Some(value) = max_interval_offset {
+ entry.max_interval_offset = Some(value);
+ interval_offset_changed = true;
+ }
+ if let Some(value) = min_interval_offset {
+ entry.min_interval_offset = Some(value);
+ interval_offset_changed = true;
+ }
+ if let Some(value) = max_connection_delay {
+ entry.max_connection_delay = Some(value);
+ connection_delay_changed = true;
+ }
+ if let Some(value) = min_connection_delay {
+ entry.min_connection_delay = Some(value);
+ connection_delay_changed = true;
+ }
+
+ // Ensure that the minimum value is <= the maximum value - but only if we have
+ // actually changed the value. We do not want to fail if the config already contains
+ // something invalid and we are changing *other* settings.
+ if entry.min_interval_offset_or_default() > entry.max_interval_offset_or_default()
+ && interval_offset_changed
+ {
+ bail!("min-interval-offset must not be larger than max-interval-offset");
+ }
+
+ if (entry.min_connection_delay_or_default() > entry.max_connection_delay_or_default())
+ && connection_delay_changed
+ {
+ bail!("min-connection-delay must not be larger than max-connection-delay");
+ }
+
+ config.set_data("default", COLLECTION_SETTINGS_TYPE, entry)?;
+ pdm_config::metric_collection::save_config(&config)?;
+
+ Ok(())
+}
diff --git a/server/src/api/config/mod.rs b/server/src/api/config/mod.rs
index 7b58c756..15dc42c8 100644
--- a/server/src/api/config/mod.rs
+++ b/server/src/api/config/mod.rs
@@ -5,6 +5,7 @@ use proxmox_sortable_macro::sortable;
pub mod access;
pub mod acme;
pub mod certificate;
+pub mod metric_collection;
pub mod notes;
#[sortable]
@@ -12,6 +13,7 @@ const SUBDIRS: SubdirMap = &sorted!([
("access", &access::ROUTER),
("acme", &acme::ROUTER),
("certificate", &certificate::ROUTER),
+ ("metric-collection", &metric_collection::ROUTER),
("notes", ¬es::ROUTER),
]);
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 20/28] api: add endpoint to trigger metric collection
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (18 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 19/28] api: add endpoint for updating metric collection settings Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 21/28] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
` (9 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This one lives under /metric-collection/trigger.
Doing a POST will trigger metric collection for a remote provided in the
'remote' parameter, or all remotes if the parameter is not provided.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/api/metric_collection.rs | 35 +++++++++++++++++++++++++++++
server/src/api/mod.rs | 2 ++
2 files changed, 37 insertions(+)
create mode 100644 server/src/api/metric_collection.rs
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
new file mode 100644
index 00000000..1a9e0e28
--- /dev/null
+++ b/server/src/api/metric_collection.rs
@@ -0,0 +1,35 @@
+use anyhow::Error;
+
+use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
+use proxmox_router::{Router, SubdirMap};
+use proxmox_schema::api;
+use proxmox_sortable_macro::sortable;
+
+pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
+
+#[sortable]
+const SUBDIRS: SubdirMap = &sorted!([(
+ "trigger",
+ &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+),]);
+
+#[api(
+ input: {
+ properties: {
+ remote: {
+ schema: REMOTE_ID_SCHEMA,
+ optional: true,
+ },
+ },
+ },
+)]
+/// Trigger metric collection for a provided remote or for all remotes if no remote is passed.
+pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+ if let Some(remote) = remote {
+ crate::metric_collection::trigger_metric_collection_for_remote(remote).await?;
+ } else {
+ crate::metric_collection::trigger_metric_collection().await?;
+ }
+
+ Ok(())
+}
diff --git a/server/src/api/mod.rs b/server/src/api/mod.rs
index 6c4831b4..ff875fc8 100644
--- a/server/src/api/mod.rs
+++ b/server/src/api/mod.rs
@@ -9,6 +9,7 @@ use proxmox_sortable_macro::sortable;
pub mod access;
pub mod config;
+pub mod metric_collection;
pub mod nodes;
pub mod pbs;
pub mod pve;
@@ -21,6 +22,7 @@ mod rrd_common;
const SUBDIRS: SubdirMap = &sorted!([
("access", &access::ROUTER),
("config", &config::ROUTER),
+ ("metric-collection", &metric_collection::ROUTER),
("ping", &Router::new().get(&API_METHOD_PING)),
("pve", &pve::ROUTER),
("pbs", &pbs::ROUTER),
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 21/28] api: remotes: trigger immediate metric collection for newly added nodes
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (19 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 20/28] api: add endpoint to trigger metric collection Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 22/28] api: add api for querying metric collection RRD data Lukas Wagner
` (8 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This makes sure that metric graphs are available right after adding a
new remote.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
server/src/api/remotes.rs | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index 8ff1ada0..27b91cfe 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -18,6 +18,7 @@ use proxmox_time::{epoch_i64, epoch_to_rfc2822};
use pdm_api_types::remotes::{Remote, RemoteType, RemoteUpdater, REMOTE_ID_SCHEMA};
use pdm_api_types::{Authid, ConfigDigest, PRIV_RESOURCE_AUDIT, PRIV_RESOURCE_MODIFY};
+use crate::metric_collection;
use crate::{connection, pbs_client};
use super::pve;
@@ -171,10 +172,15 @@ pub async fn add_remote(mut entry: Remote, create_token: Option<String>) -> Resu
entry.token = token;
}
+ let name = entry.id.clone();
remotes.insert(entry.id.to_owned(), entry);
pdm_config::remotes::save_config(&remotes)?;
+ if let Err(e) = metric_collection::trigger_metric_collection_for_remote(name).await {
+ log::error!("could not trigger metric collection after adding remote: {e}");
+ }
+
Ok(())
}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 22/28] api: add api for querying metric collection RRD data
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (20 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 21/28] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 23/28] api: metric-collection: add status endpoint Lukas Wagner
` (7 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This commit adds two new API endpoints:
- remotes/{id}/metric-collection-rrddata
- metric-collection/rrddata
The first one returns graphable datapoints for the API response time
when fetching metrics, the second one datapoints about the total time
needed for collecting *all* remotes.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
lib/pdm-api-types/src/rrddata.rs | 26 +++++++++++++
server/src/api/metric_collection.rs | 60 ++++++++++++++++++++++++++---
server/src/api/remotes.rs | 53 +++++++++++++++++++++++++
3 files changed, 134 insertions(+), 5 deletions(-)
diff --git a/lib/pdm-api-types/src/rrddata.rs b/lib/pdm-api-types/src/rrddata.rs
index a973977c..7559243c 100644
--- a/lib/pdm-api-types/src/rrddata.rs
+++ b/lib/pdm-api-types/src/rrddata.rs
@@ -216,3 +216,29 @@ pub struct PbsDatastoreDataPoint {
#[serde(skip_serializing_if = "Option::is_none")]
pub disk_write: Option<f64>,
}
+
+#[api]
+#[derive(Serialize, Deserialize, Default)]
+#[serde(rename_all = "kebab-case")]
+/// RRD datapoint for statistics about the metric collection loop.
+pub struct FullCollectionDatapoint {
+ /// Timestamp (UNIX epoch)
+ pub time: u64,
+
+ /// Total time in milliseconds needed for full metric collection run.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub total_time: Option<f64>,
+}
+
+#[api]
+#[derive(Serialize, Deserialize, Default)]
+#[serde(rename_all = "kebab-case")]
+/// RRD datapoint for metric collection from a single remote.
+pub struct RemoteCollectionDatapoint {
+ /// Timestamp (UNIX epoch)
+ pub time: u64,
+
+ /// API response time in milliseconds when requesting the metrics from the remote.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub response_time: Option<f64>,
+}
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 1a9e0e28..22736cb0 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -1,17 +1,27 @@
use anyhow::Error;
-use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
use proxmox_router::{Router, SubdirMap};
+use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
use proxmox_schema::api;
use proxmox_sortable_macro::sortable;
+use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint};
+
+use super::rrd_common::{self, DataPoint};
+
pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
#[sortable]
-const SUBDIRS: SubdirMap = &sorted!([(
- "trigger",
- &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
-),]);
+const SUBDIRS: SubdirMap = &sorted!([
+ (
+ "trigger",
+ &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+ ),
+ (
+ "rrddata",
+ &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_RRD_DATA)
+ ),
+]);
#[api(
input: {
@@ -33,3 +43,43 @@ pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Err
Ok(())
}
+
+impl DataPoint for FullCollectionDatapoint {
+ fn new(time: u64) -> Self {
+ Self {
+ time,
+ ..Default::default()
+ }
+ }
+
+ fn fields() -> &'static [&'static str] {
+ &["total-time"]
+ }
+
+ fn set_field(&mut self, name: &str, value: f64) {
+ if name == "total-time" {
+ self.total_time = Some(value);
+ }
+ }
+}
+
+#[api(
+ input: {
+ properties: {
+ timeframe: {
+ type: RrdTimeframe,
+ },
+ cf: {
+ type: RrdMode,
+ },
+ },
+ },
+)]
+/// Read metric collection RRD data.
+fn get_metric_collection_rrd_data(
+ timeframe: RrdTimeframe,
+ cf: RrdMode,
+) -> Result<Vec<FullCollectionDatapoint>, Error> {
+ let base = "local/metric-collection";
+ rrd_common::create_datapoints_from_rrd(base, timeframe, cf)
+}
diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index 27b91cfe..bbb24b26 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -9,6 +9,8 @@ use proxmox_access_control::CachedUserInfo;
use proxmox_router::{
http_bail, http_err, list_subdirs_api_method, Permission, Router, RpcEnvironment, SubdirMap,
};
+use proxmox_rrd_api_types::RrdMode;
+use proxmox_rrd_api_types::RrdTimeframe;
use proxmox_schema::api;
use proxmox_schema::Schema;
use proxmox_section_config::typed::SectionConfigData;
@@ -16,12 +18,15 @@ use proxmox_sortable_macro::sortable;
use proxmox_time::{epoch_i64, epoch_to_rfc2822};
use pdm_api_types::remotes::{Remote, RemoteType, RemoteUpdater, REMOTE_ID_SCHEMA};
+use pdm_api_types::rrddata::RemoteCollectionDatapoint;
use pdm_api_types::{Authid, ConfigDigest, PRIV_RESOURCE_AUDIT, PRIV_RESOURCE_MODIFY};
use crate::metric_collection;
use crate::{connection, pbs_client};
use super::pve;
+use super::rrd_common;
+use super::rrd_common::DataPoint;
pub const ROUTER: Router = Router::new()
.get(&API_METHOD_LIST_REMOTES)
@@ -38,6 +43,10 @@ const ITEM_ROUTER: Router = Router::new()
const SUBDIRS: SubdirMap = &sorted!([
("config", &Router::new().get(&API_METHOD_REMOTE_CONFIG)),
("version", &Router::new().get(&API_METHOD_VERSION)),
+ (
+ "metric-collection-rrddata",
+ &Router::new().get(&API_METHOD_GET_PER_REMOTE_METRIC_COLLECTION_RRD_DATA)
+ ),
]);
pub fn get_remote<'a>(
@@ -331,3 +340,47 @@ pub fn remote_config(id: String) -> Result<Remote, Error> {
remote.token = String::new(); // mask token in response
Ok(remote.clone())
}
+
+impl DataPoint for RemoteCollectionDatapoint {
+ fn new(time: u64) -> Self {
+ Self {
+ time,
+ ..Default::default()
+ }
+ }
+
+ fn fields() -> &'static [&'static str] {
+ &["response-time"]
+ }
+
+ fn set_field(&mut self, name: &str, value: f64) {
+ if name == "response-time" {
+ self.response_time = Some(value);
+ }
+ }
+}
+
+#[api(
+ input: {
+ properties: {
+ id: {
+ schema: REMOTE_ID_SCHEMA,
+ },
+ timeframe: {
+ type: RrdTimeframe,
+ },
+ cf: {
+ type: RrdMode,
+ },
+ },
+ },
+)]
+/// Read metric collection RRD data.
+fn get_per_remote_metric_collection_rrd_data(
+ id: String,
+ timeframe: RrdTimeframe,
+ cf: RrdMode,
+) -> Result<Vec<RemoteCollectionDatapoint>, Error> {
+ let base = format!("local/metric-collection/remotes/{id}");
+ rrd_common::create_datapoints_from_rrd(&base, timeframe, cf)
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 23/28] api: metric-collection: add status endpoint
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (21 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 22/28] api: add api for querying metric collection RRD data Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 24/28] pdm-client: add metric collection API methods Lukas Wagner
` (6 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This adds a new API endpoint at /metric-collection/status, returning the
status of the last metric collection attempt from each remote.
For now, this contains the timestamp (last-collected) and any
error that occurred (error).
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
lib/pdm-api-types/src/metric_collection.rs | 15 +++++++++++++++
server/src/api/metric_collection.rs | 16 +++++++++++++++-
server/src/metric_collection/mod.rs | 22 ++++++++++++++++++++++
3 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
index 92487d6c..a1157897 100644
--- a/lib/pdm-api-types/src/metric_collection.rs
+++ b/lib/pdm-api-types/src/metric_collection.rs
@@ -186,3 +186,18 @@ impl CollectionSettings {
.unwrap_or(DEFAULT_MIN_CONNECTION_DELAY)
}
}
+
+#[api]
+#[derive(Clone, Deserialize, Serialize)]
+#[serde(rename_all = "kebab-case")]
+/// Per-remote collection status.
+pub struct MetricCollectionStatus {
+ /// The remote's name.
+ pub remote: String,
+ /// Any error that occured during the last collection attempt.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub error: Option<String>,
+ /// Timestamp of last successful collection.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub last_collection: Option<i64>,
+}
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 22736cb0..a8243296 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -5,7 +5,11 @@ use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
use proxmox_schema::api;
use proxmox_sortable_macro::sortable;
-use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint};
+use pdm_api_types::{
+ remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint, MetricCollectionStatus,
+};
+
+use crate::metric_collection;
use super::rrd_common::{self, DataPoint};
@@ -21,6 +25,10 @@ const SUBDIRS: SubdirMap = &sorted!([
"rrddata",
&Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_RRD_DATA)
),
+ (
+ "status",
+ &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_STATUS)
+ ),
]);
#[api(
@@ -83,3 +91,9 @@ fn get_metric_collection_rrd_data(
let base = "local/metric-collection";
rrd_common::create_datapoints_from_rrd(base, timeframe, cf)
}
+
+#[api]
+/// Read metric collection RRD data.
+fn get_metric_collection_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+ metric_collection::get_status()
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 5b6c14d2..65ceb4b3 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -7,6 +7,8 @@ use tokio::sync::mpsc::{self, Sender};
use proxmox_sys::fs::CreateOptions;
+use pdm_api_types::MetricCollectionStatus;
+
mod collection_task;
pub mod rrd_cache;
mod rrd_task;
@@ -87,3 +89,23 @@ pub async fn trigger_metric_collection() -> Result<(), Error> {
Ok(())
}
+
+/// Get each remote's metric collection status.
+pub fn get_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+ let (remotes, _) = pdm_config::remotes::config()?;
+ let state = collection_task::load_state()?;
+
+ let mut result = Vec::new();
+
+ for remote in remotes.order {
+ if let Some(status) = state.get_status(&remote) {
+ result.push(MetricCollectionStatus {
+ remote,
+ error: status.error.clone(),
+ last_collection: status.last_collection,
+ })
+ }
+ }
+
+ Ok(result)
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 24/28] pdm-client: add metric collection API methods
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (22 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 23/28] api: metric-collection: add status endpoint Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 25/28] cli: add commands for metric-collection settings, trigger, status Lukas Wagner
` (5 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This adds bindings for collection settings, trigger, and status APIs.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
lib/pdm-client/src/lib.rs | 87 +++++++++++++++++++++++++++++++++++++++
1 file changed, 87 insertions(+)
diff --git a/lib/pdm-client/src/lib.rs b/lib/pdm-client/src/lib.rs
index a41b82ca..70880576 100644
--- a/lib/pdm-client/src/lib.rs
+++ b/lib/pdm-client/src/lib.rs
@@ -320,6 +320,93 @@ impl<T: HttpApiClient> PdmClient<T> {
.ok_or_else(|| Error::BadApi("api returned no webauthn entry id".to_string(), None))
}
+ /// Get global metric collection settings.
+ pub async fn get_metric_collection_settings(
+ &self,
+ ) -> Result<pdm_api_types::CollectionSettings, Error> {
+ let path = "/api2/extjs/config/metric-collection/default";
+ Ok(self.0.get(path).await?.expect_json()?.data)
+ }
+
+ /// Update global metric collection settings.
+ pub async fn update_metric_collection_settings(
+ &self,
+ updater: pdm_api_types::CollectionSettingsUpdater,
+ delete: Option<Vec<pdm_api_types::DeletableCollectionSettingsProperty>>,
+ ) -> Result<(), Error> {
+ let path = "/api2/extjs/config/metric-collection/default";
+
+ #[derive(Serialize)]
+ struct UpdateSettings<'a> {
+ updater: &'a pdm_api_types::CollectionSettingsUpdater,
+ #[serde(skip_serializing_if = "Option::is_none")]
+ delete: Option<Vec<pdm_api_types::DeletableCollectionSettingsProperty>>,
+ }
+
+ let params = UpdateSettings {
+ updater: &updater,
+ delete,
+ };
+ self.0.put(path, ¶ms).await?.nodata()?;
+ Ok(())
+ }
+
+ /// Trigger metric collection for a single remote or for all remotes, if no remote is provided.
+ pub async fn trigger_metric_collection(
+ &self,
+ remote: Option<&str>,
+ ) -> Result<(), proxmox_client::Error> {
+ let path = "/api2/extjs/metric-collection/trigger";
+
+ #[derive(Serialize)]
+ struct TriggerParams<'a> {
+ #[serde(skip_serializing_if = "Option::is_none")]
+ remote: Option<&'a str>,
+ }
+
+ self.0
+ .post(path, &TriggerParams { remote })
+ .await?
+ .nodata()?;
+
+ Ok(())
+ }
+
+ /// Get global metric collection status.
+ pub async fn get_metric_collection_status(
+ &self,
+ ) -> Result<Vec<pdm_api_types::MetricCollectionStatus>, Error> {
+ let path = "/api2/extjs/metric-collection/status";
+ Ok(self.0.get(path).await?.expect_json()?.data)
+ }
+
+ /// Get global metric collection status.
+ pub async fn get_metric_collection_rrddata(
+ &self,
+ mode: RrdMode,
+ timeframe: RrdTimeframe,
+ ) -> Result<pdm_api_types::rrddata::FullCollectionDatapoint, Error> {
+ let mut path = "/api2/extjs/metric-collection/rrddata".to_string();
+ let mut sep = '?';
+ add_query_arg(&mut path, &mut sep, "cf", &Some(mode));
+ add_query_arg(&mut path, &mut sep, "timeframe", &Some(timeframe));
+ Ok(self.0.get(&path).await?.expect_json()?.data)
+ }
+
+ /// Get per-remote metric collection status.
+ pub async fn get_per_remote_metric_collection_rrddata(
+ &self,
+ remote: &str,
+ mode: RrdMode,
+ timeframe: RrdTimeframe,
+ ) -> Result<pdm_api_types::rrddata::RemoteCollectionDatapoint, Error> {
+ let mut path = format!("/api2/extjs/remotes/{remote}/metric-collection-rrddata");
+ let mut sep = '?';
+ add_query_arg(&mut path, &mut sep, "cf", &Some(mode));
+ add_query_arg(&mut path, &mut sep, "timeframe", &Some(timeframe));
+ Ok(self.0.get(&path).await?.expect_json()?.data)
+ }
+
pub async fn pve_list_nodes(
&self,
remote: &str,
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 25/28] cli: add commands for metric-collection settings, trigger, status
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (23 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 24/28] pdm-client: add metric collection API methods Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 26/28] metric collection: factor out handle_tick and handle_control_message fns Lukas Wagner
` (4 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This adds new commands to the proxmox-datacenter-client CLI tool, namely
- to update metric collection settings
- trigger metric collection
- show status of the last metric collection
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
Changes since v1:
- update-settings -> 'settings update'
- show-settings -> 'settings show'
cli/client/Cargo.toml | 1 +
cli/client/src/main.rs | 2 +
cli/client/src/metric_collection.rs | 170 ++++++++++++++++++++++++++++
3 files changed, 173 insertions(+)
create mode 100644 cli/client/src/metric_collection.rs
diff --git a/cli/client/Cargo.toml b/cli/client/Cargo.toml
index 96016cd3..5c96b284 100644
--- a/cli/client/Cargo.toml
+++ b/cli/client/Cargo.toml
@@ -46,6 +46,7 @@ proxmox-rrd-api-types.workspace = true
proxmox-schema = { workspace = true, features = [ "api-macro" ] }
proxmox-section-config.workspace = true
proxmox-sys.workspace = true
+proxmox-time.workspace = true
proxmox-tfa = { workspace = true, features = [ "types" ] }
# for the pve API types:
diff --git a/cli/client/src/main.rs b/cli/client/src/main.rs
index c913e978..55877cf7 100644
--- a/cli/client/src/main.rs
+++ b/cli/client/src/main.rs
@@ -17,6 +17,7 @@ pub mod env;
pub mod acl;
pub mod config;
+pub mod metric_collection;
pub mod pbs;
pub mod pve;
pub mod remotes;
@@ -93,6 +94,7 @@ fn main_do() -> Result<(), Error> {
)
.insert("acl", acl::cli())
.insert("login", CliCommand::new(&API_METHOD_LOGIN))
+ .insert("metric-collection", metric_collection::cli())
.insert("pbs", pbs::cli())
.insert("pve", pve::cli())
.insert("remote", remotes::cli())
diff --git a/cli/client/src/metric_collection.rs b/cli/client/src/metric_collection.rs
new file mode 100644
index 00000000..1eac6415
--- /dev/null
+++ b/cli/client/src/metric_collection.rs
@@ -0,0 +1,170 @@
+use anyhow::Error;
+use pdm_api_types::{
+ remotes::REMOTE_ID_SCHEMA, CollectionSettingsUpdater, DeletableCollectionSettingsProperty,
+};
+use proxmox_router::cli::{
+ format_and_print_result, CliCommand, CliCommandMap, CommandLineInterface, OutputFormat,
+};
+use proxmox_schema::api;
+
+use crate::{client, env};
+
+pub fn cli() -> CommandLineInterface {
+ CliCommandMap::new()
+ .insert("settings", settings_cli())
+ .insert(
+ "trigger",
+ CliCommand::new(&API_METHOD_TRIGGER_METRIC_COLLECTION),
+ )
+ .insert(
+ "status",
+ CliCommand::new(&API_METHOD_METRIC_COLLECTION_STATUS),
+ )
+ .into()
+}
+
+fn settings_cli() -> CommandLineInterface {
+ CliCommandMap::new()
+ .insert(
+ "show",
+ CliCommand::new(&API_METHOD_SHOW_COLLECTION_SETTINGS),
+ )
+ .insert(
+ "update",
+ CliCommand::new(&API_METHOD_UPDATE_COLLECTION_SETTINGS),
+ )
+ .into()
+}
+
+#[api]
+/// Show metric collection settings.
+async fn show_collection_settings() -> Result<(), Error> {
+ let client = client()?;
+ let settings = client.get_metric_collection_settings().await?;
+
+ fn default_if_none<T>(value: Option<T>) -> &'static str {
+ if value.is_none() {
+ " (default)"
+ } else {
+ ""
+ }
+ }
+
+ let output_format = env().format_args.output_format;
+ if output_format == OutputFormat::Text {
+ println!("Metric collection settings:");
+ println!(
+ " collection-interval: {}s{}",
+ settings.collection_interval_or_default(),
+ default_if_none(settings.collection_interval),
+ );
+ println!(
+ " max-interval-offset: {}s{}",
+ settings.max_interval_offset_or_default(),
+ default_if_none(settings.max_interval_offset)
+ );
+ println!(
+ " min-interval-offset: {}s{}",
+ settings.min_interval_offset_or_default(),
+ default_if_none(settings.min_interval_offset)
+ );
+ println!(
+ " max-connection-delay: {}ms{}",
+ settings.max_connection_delay_or_default(),
+ default_if_none(settings.max_connection_delay)
+ );
+ println!(
+ " min-connection-delay: {}ms{}",
+ settings.min_connection_delay_or_default(),
+ default_if_none(settings.min_connection_delay)
+ );
+ println!(
+ " max-concurrent-connections: {}{}",
+ settings.max_concurrent_connections_or_default(),
+ default_if_none(settings.max_concurrent_connections),
+ );
+ } else {
+ format_and_print_result(&settings, &output_format.to_string());
+ }
+ Ok(())
+}
+
+#[api(
+ input: {
+ properties: {
+ updater: {
+ type: CollectionSettingsUpdater,
+ flatten: true,
+ },
+ delete: {
+ description: "Reset previously set collection settings properties.",
+ optional: true,
+ type: Array,
+ items: {
+ type: DeletableCollectionSettingsProperty,
+ },
+ },
+ }
+ }
+)]
+/// Change metric collection settings.
+async fn update_collection_settings(
+ updater: CollectionSettingsUpdater,
+ delete: Option<Vec<DeletableCollectionSettingsProperty>>,
+) -> Result<(), Error> {
+ let client = client()?;
+ client
+ .update_metric_collection_settings(updater, delete)
+ .await?;
+
+ Ok(())
+}
+
+#[api(
+ input: {
+ properties: {
+ remote: {
+ schema: REMOTE_ID_SCHEMA,
+ optional: true,
+ },
+ }
+ }
+)]
+/// Trigger metric collection. If a remote is passed, only this remote will be collected, otherwise
+/// all.
+async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+ client()?
+ .trigger_metric_collection(remote.as_deref())
+ .await?;
+ Ok(())
+}
+
+#[api]
+/// Show metric collection status.
+async fn metric_collection_status() -> Result<(), Error> {
+ let result = client()?.get_metric_collection_status().await?;
+
+ let output_format = env().format_args.output_format;
+ if output_format == OutputFormat::Text {
+ for remote_status in result {
+ let timestamp = if let Some(last_collection) = remote_status.last_collection {
+ proxmox_time::strftime_local("%a, %d %b %Y %T %z", last_collection)?
+ } else {
+ "never".into()
+ };
+
+ let status = if let Some(err) = &remote_status.error {
+ err
+ } else {
+ "ok"
+ };
+
+ println!("{}: {status}", remote_status.remote);
+ println!(" last successful: {timestamp}");
+ println!();
+ }
+ } else {
+ format_and_print_result(&result, &output_format.to_string());
+ }
+ Ok(())
+}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 26/28] metric collection: factor out handle_tick and handle_control_message fns
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (24 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 25/28] cli: add commands for metric-collection settings, trigger, status Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 27/28] metric collection: skip missed timer ticks Lukas Wagner
` (3 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
No functional changes, just moving some code into seperate methods for
better readability.
Suggested-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
New in v2.
.../src/metric_collection/collection_task.rs | 99 ++++++++++---------
1 file changed, 54 insertions(+), 45 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 2ea735c5..954df0d6 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -89,53 +89,12 @@ impl MetricCollectionTask {
_ = timer.tick() => {
// Reload settings in case they have changed in the meanwhile
self.settings = Self::get_settings_or_default();
-
- log::debug!("starting metric collection from all remotes - triggered by timer");
- Self::sleep_for_random_millis(
- self.settings.min_interval_offset_or_default() * 1000,
- self.settings.max_interval_offset_or_default() * 1000,
- "interval-offset",
- ).await;
-
- if let Some(remotes) = Self::load_remote_config() {
- self.cleanup_removed_remotes_from_state(&remotes);
-
- let now = Instant::now();
- let to_fetch = remotes.order.as_slice();
- self.fetch_remotes(&remotes, to_fetch).await;
- let elapsed = now.elapsed();
-
- if let Err(err) = self.metric_data_tx.send(
- RrdStoreRequest::CollectionStats {
- timestamp: proxmox_time::epoch_i64(),
- stats: CollectionStats {
- // TODO: use as_millis_f64 once stabilized
- total_time: elapsed.as_secs_f64() * 1000.
- }
- }).await {
- log::error!("could not send collection stats to rrd task: {err}");
- }
- }
+ self.handle_tick().await;
}
-
- val = self.control_message_rx.recv() => {
- // Reload settings in case they have changed in the meanwhile
+ Some(message) = self.control_message_rx.recv() => {
+ // Reload settings in case they have changed in the meanwhile.
self.settings = Self::get_settings_or_default();
- match val {
- Some(ControlMsg::CollectSingleRemote(remote)) => {
- if let Some(remotes) = Self::load_remote_config() {
- log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
- self.fetch_remotes(&remotes, &[remote]).await;
- }
- }
- Some(ControlMsg::CollectAllRemotes) => {
- if let Some(remotes) = Self::load_remote_config() {
- log::debug!("starting metric collection from all remotes - triggered by control message");
- self.fetch_remotes(&remotes, &remotes.order).await;
- }
- }
- _ => {},
- }
+ self.handle_control_message(message).await;
}
}
@@ -158,6 +117,56 @@ impl MetricCollectionTask {
}
}
+ /// Handle a timer tick.
+ async fn handle_tick(&mut self) {
+ log::debug!("starting metric collection from all remotes - triggered by timer");
+ Self::sleep_for_random_millis(
+ self.settings.min_interval_offset_or_default() * 1000,
+ self.settings.max_interval_offset_or_default() * 1000,
+ "interval-offset",
+ )
+ .await;
+
+ if let Some(remotes) = Self::load_remote_config() {
+ self.cleanup_removed_remotes_from_state(&remotes);
+
+ let now = Instant::now();
+ let to_fetch = remotes.order.as_slice();
+ self.fetch_remotes(&remotes, to_fetch).await;
+ let elapsed = now.elapsed();
+
+ if let Err(err) = self
+ .metric_data_tx
+ .send(RrdStoreRequest::CollectionStats {
+ timestamp: proxmox_time::epoch_i64(),
+ stats: CollectionStats {
+ // TODO: use as_millis_f64 once stabilized
+ total_time: elapsed.as_secs_f64() * 1000.,
+ },
+ })
+ .await
+ {
+ log::error!("could not send collection stats to rrd task: {err}");
+ }
+ }
+ }
+
+ /// Handle a control message for force-triggered collection.
+ async fn handle_control_message(&mut self, message: ControlMsg) {
+ if let Some(remotes) = Self::load_remote_config() {
+ match message {
+ ControlMsg::CollectSingleRemote(remote) => {
+ log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
+ self.fetch_remotes(&remotes, &[remote]).await;
+ }
+ ControlMsg::CollectAllRemotes => {
+ log::debug!("starting metric collection from all remotes - triggered by control message");
+ self.fetch_remotes(&remotes, &remotes.order).await;
+ }
+ }
+ }
+ }
+
/// Sleep between `min` and `max` milliseconds.
///
/// If `min` is larger than `max`, `min` will be set to `max` and a log message
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 27/28] metric collection: skip missed timer ticks
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (25 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 26/28] metric collection: factor out handle_tick and handle_control_message fns Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 28/28] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
` (2 subsequent siblings)
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
The default behavior of `tokio::time::Interval` when a tick is missed
is 'burst', which is to shorten the timer intervals until the original
alignment is restored, while keeping the *number of ticks* the same
the same as expected.
Example from the official tokio docs [1] for burst mode:
Expected: | 1 | 2 | 3 | 4 | 5 | 6 |
Actual: | work ---| delay |work|work|work-|work-----|
For metric collection, we do not really gain anything from bursting
missed ticks. For us, 'skip' is fine. There, the alignment is
immediately restored by skipping any ticks that have been missed:
Expected: | 1 | 2 | 3 | 4 | 5 | 6 |
Actual: | work ---| delay |work---| work----| work----|
https://docs.rs/tokio/latest/tokio/time/enum.MissedTickBehavior.html
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
New in v2.
server/src/metric_collection/collection_task.rs | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 954df0d6..5c6e2762 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -10,7 +10,7 @@ use tokio::{
mpsc::{Receiver, Sender},
oneshot, OwnedSemaphorePermit, Semaphore,
},
- time::Interval,
+ time::{Interval, MissedTickBehavior},
};
use proxmox_section_config::typed::SectionConfigData;
@@ -230,6 +230,13 @@ impl MetricCollectionTask {
fn setup_timer(interval: u64) -> (Interval, Instant) {
log::debug!("setting metric collection interval timer to {interval} seconds.",);
let mut timer = tokio::time::interval(Duration::from_secs(interval));
+
+ // If we miss a tick because a previous collection run took too long, we want to
+ // tick as soon as possible, but we do not need to repeat missing ticks.
+ // We do want to stay aligned, though.
+ // https://docs.rs/tokio/latest/tokio/time/enum.MissedTickBehavior.html#variant.Skip
+ timer.set_missed_tick_behavior(MissedTickBehavior::Skip);
+
let first_run = task_utils::next_aligned_instant(interval);
timer.reset_at(first_run.into());
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v2 28/28] metric collection: use JoinSet instead of joining from handles in a Vec
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (26 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 27/28] metric collection: skip missed timer ticks Lukas Wagner
@ 2025-02-14 13:06 ` Lukas Wagner
2025-02-21 13:19 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Maximiliano Sandoval
2025-03-14 14:10 ` Lukas Wagner
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-14 13:06 UTC (permalink / raw)
To: pdm-devel
This lets us process finished tasks in the order they finish, not in the
order they were spawned.
Suggested-by: Wolfang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
New in v2.
.../src/metric_collection/collection_task.rs | 25 ++++++++-----------
1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 5c6e2762..4e8aa627 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -10,6 +10,7 @@ use tokio::{
mpsc::{Receiver, Sender},
oneshot, OwnedSemaphorePermit, Semaphore,
},
+ task::JoinSet,
time::{Interval, MissedTickBehavior},
};
@@ -267,7 +268,8 @@ impl MetricCollectionTask {
let semaphore = Arc::new(Semaphore::new(
self.settings.max_concurrent_connections_or_default(),
));
- let mut handles = Vec::new();
+
+ let mut handles = JoinSet::new();
let now = proxmox_time::epoch_i64();
for remote_name in remotes_to_fetch {
@@ -290,30 +292,23 @@ impl MetricCollectionTask {
if let Some(remote) = remote_config.get(remote_name).cloned() {
log::debug!("fetching remote '{}'", remote.id);
- let handle = tokio::spawn(Self::fetch_single_remote(
+ handles.spawn(Self::fetch_single_remote(
self.settings.clone(),
remote,
status,
self.metric_data_tx.clone(),
permit,
));
-
- handles.push((remote_name.clone(), handle));
}
}
- for (remote_name, handle) in handles {
- let res = handle.await;
-
+ while let Some(res) = handles.join_next().await {
match res {
- Ok(Ok(ts)) => {
- self.state.set_status(remote_name, ts);
+ Ok((name, status)) => {
+ self.state.set_status(name, status);
}
- Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
Err(err) => {
- log::error!(
- "join error for metric collection task for remote {remote_name}: {err}"
- )
+ log::error!("join error for metric collection task for remote: {err}")
}
}
}
@@ -362,7 +357,7 @@ impl MetricCollectionTask {
mut status: RemoteStatus,
sender: Sender<RrdStoreRequest>,
_permit: OwnedSemaphorePermit,
- ) -> Result<RemoteStatus, Error> {
+ ) -> (String, RemoteStatus) {
Self::sleep_for_random_millis(
settings.min_connection_delay_or_default(),
settings.max_connection_delay_or_default(),
@@ -436,7 +431,7 @@ impl MetricCollectionTask {
}
}
- Ok(status)
+ (remote.id, status)
}
}
--
2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v2 03/28] pdm-api-types: add CollectionSettings type
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 03/28] pdm-api-types: add CollectionSettings type Lukas Wagner
@ 2025-02-18 15:26 ` Wolfgang Bumiller
2025-02-18 15:31 ` Stefan Hanreich
2025-02-21 8:27 ` Lukas Wagner
0 siblings, 2 replies; 34+ messages in thread
From: Wolfgang Bumiller @ 2025-02-18 15:26 UTC (permalink / raw)
To: Lukas Wagner; +Cc: pdm-devel
On Fri, Feb 14, 2025 at 02:06:28PM +0100, Lukas Wagner wrote:
> This commit adds the CollectionSettings type which holds settings for
> the metric collection system. Included are collection interval, max
> concurrency and upper/lower bounds for the metric collection loop.
>
> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
> ---
> lib/pdm-api-types/src/lib.rs | 3 +
> lib/pdm-api-types/src/metric_collection.rs | 188 +++++++++++++++++++++
> 2 files changed, 191 insertions(+)
> create mode 100644 lib/pdm-api-types/src/metric_collection.rs
>
> diff --git a/lib/pdm-api-types/src/lib.rs b/lib/pdm-api-types/src/lib.rs
> index 38449071..6115e41c 100644
> --- a/lib/pdm-api-types/src/lib.rs
> +++ b/lib/pdm-api-types/src/lib.rs
> @@ -19,6 +19,9 @@ pub use acl::*;
> mod node_config;
> pub use node_config::*;
>
> +mod metric_collection;
> +pub use metric_collection::*;
> +
> mod proxy;
> pub use proxy::HTTP_PROXY_SCHEMA;
>
> diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
> new file mode 100644
> index 00000000..92487d6c
> --- /dev/null
> +++ b/lib/pdm-api-types/src/metric_collection.rs
> @@ -0,0 +1,188 @@
> +//! API types for metric collection settings.
> +
> +use serde::{Deserialize, Serialize};
> +
> +use proxmox_schema::{api, Updater};
> +
> +/// Default metric collection interval.
> +pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
> +
> +/// Minimum metric collection interval.
> +pub const MIN_COLLECTION_INTERVAL: u64 = 10;
> +
> +/// Maximum metric collection interval.
> +/// PVE and PBS keep 30 minutes of metric history,
> +/// maximum is set to 25 minutes to leave some headroom.
> +pub const MAX_COLLECTION_INTERVAL: u64 = 1500;
> +
> +/// Default number of concurrent connections.
> +pub const DEFAULT_CONCURRENCY: usize = 10;
> +/// Maximum number of concurrent connections.
> +pub const MAX_CONCURRENCY: u64 = 50;
> +/// Minimum number of concurrent connections (no concurrency).
> +pub const MIN_CONCURRENCY: u64 = 1;
> +
> +/// Default upper bound for the random delay for the
> +/// main metric collection loop.
> +pub const DEFAULT_MAX_OFFSET: u64 = 10;
> +/// Default lower bound for the random delay for the
> +/// main metric collection loop.
> +pub const DEFAULT_MIN_OFFSET: u64 = 0;
> +
> +/// Highest configurable upper bound for the random interval offset for the main loop.
> +pub const MAX_INTERVAL_OFFSET: u64 = 300;
> +/// Lowest configureable lower bound for the random interval offset for the main loop.
> +pub const MIN_INTERVAL_OFFSET: u64 = 0;
> +
> +/// Default upper bound for the random individual connection delay.
> +pub const DEFAULT_MAX_CONNECTION_DELAY: u64 = 100;
> +/// Default lower bound for the random individual connection delay.
> +pub const DEFAULT_MIN_CONNECTION_DELAY: u64 = 0;
> +
> +/// Highest configurable upper bound for the random individual connection delay.
> +pub const MAX_CONNECTION_DELAY: u64 = 1000;
> +/// Lowest configurable lower bound for the random individual connection delay.
> +pub const MIN_CONNECTION_DELAY: u64 = 0;
> +
> +#[api(
> + properties: {
> + "collection-interval" : {
> + optional: true,
> + default: DEFAULT_COLLECTION_INTERVAL as isize,
> + minimum: MIN_COLLECTION_INTERVAL as isize,
> + maximum: MAX_COLLECTION_INTERVAL as isize,
I thought about this the first time around but then forgot again:
Given that it is possible we might change these types (see Stefan's
patch for proxmox-schema) and because it is much more convenient
anyway you could use `as _` for these casts.
While I'm usually not a fan of `_` casts - I think this might be okay
for schema declarations.
(No need for a v3 if there's nothing else...)
> + },
> + "max-concurrent-connections" : {
> + optional: true,
> + default: DEFAULT_CONCURRENCY as isize,
> + minimum: MIN_CONCURRENCY as isize,
> + maximum: MAX_CONCURRENCY as isize,
> + },
> + "max-interval-offset" : {
> + optional: true,
> + default: DEFAULT_MAX_OFFSET as isize,
> + minimum: MIN_INTERVAL_OFFSET as isize,
> + maximum: MAX_INTERVAL_OFFSET as isize,
> + },
> + "min-interval-offset" : {
> + optional: true,
> + default: DEFAULT_MIN_OFFSET as isize,
> + minimum: MIN_INTERVAL_OFFSET as isize,
> + maximum: MAX_INTERVAL_OFFSET as isize,
> + },
> + "max-connection-delay" : {
> + optional: true,
> + default: DEFAULT_MAX_CONNECTION_DELAY as isize,
> + minimum: MIN_CONNECTION_DELAY as isize,
> + maximum: MAX_CONNECTION_DELAY as isize,
> + },
> + "min-connection-delay" : {
> + optional: true,
> + default: DEFAULT_MIN_CONNECTION_DELAY as isize,
> + minimum: MIN_CONNECTION_DELAY as isize,
> + maximum: MAX_CONNECTION_DELAY as isize,
> + },
> + },
> +)]
> +#[derive(Clone, Default, Deserialize, Serialize, Updater)]
> +#[serde(rename_all = "kebab-case")]
> +/// Settings for the metric collection system.
> +pub struct CollectionSettings {
> + /// Collection settings ID
> + #[updater(skip)]
> + pub id: String,
> +
> + /// Interval in seconds at which to collect metrics.
> + /// The point in time at which metrics are collected
> + /// are aligned based on the collection interval. For instance,
> + /// a collection interval of 300 (5 minutes) would schedule metric collection
> + /// at 11:00:00, 11:05:00 (without accounting for the random offset).
> + ///
> + /// To avoid load spikes, metric collection runs are offeset by
> + /// a random number of seconds between min_interval_offset and max_interval_offset.
> + #[serde(skip_serializing_if = "Option::is_none")]
> + pub collection_interval: Option<u64>,
> +
> + /// Maximum number of concurrent connections while collecting metrics.
> + #[serde(skip_serializing_if = "Option::is_none")]
> + pub max_concurrent_connections: Option<usize>,
> +
> + /// Maximum offset in seconds for the metric collection interval.
> + #[serde(skip_serializing_if = "Option::is_none")]
> + pub max_interval_offset: Option<u64>,
> +
> + /// Minimum offset in seconds for the metric collection interval.
> + #[serde(skip_serializing_if = "Option::is_none")]
> + pub min_interval_offset: Option<u64>,
> +
> + /// Maximum random delay in milliseconds for each outgoing connection.
> + #[serde(skip_serializing_if = "Option::is_none")]
> + pub max_connection_delay: Option<u64>,
> +
> + /// Minimum random delay in milliseconds for each outgoing connection.
> + #[serde(skip_serializing_if = "Option::is_none")]
> + pub min_connection_delay: Option<u64>,
> +}
> +
> +#[api]
> +#[derive(Copy, Clone, Deserialize, Serialize)]
> +#[serde(rename_all = "kebab-case")]
> +/// Deletable property for [`CollectionSettings`].
> +pub enum DeletableCollectionSettingsProperty {
> + /// Delete 'collection-interval'.
> + CollectionInterval,
> + /// Delete 'max-concurrent-connections'.
> + MaxConcurrentConnections,
> + /// Delete 'max-interval-offset'.
> + MaxIntervalOffset,
> + /// Delete 'min-interval-offset'.
> + MinIntervalOffset,
> + /// Delete 'min-connection-delay'.
> + MaxConnectionDelay,
> + /// Delete 'min-connection-delay'.
> + MinConnectionDelay,
> +}
> +
> +impl CollectionSettings {
> + /// Create a new settings instance with a given `id`.
> + pub fn new(id: &str) -> Self {
> + Self {
> + id: id.into(),
> + ..Default::default()
> + }
> + }
> +
> + /// Return the collection interval or the default if not configured.
> + pub fn collection_interval_or_default(&self) -> u64 {
> + self.collection_interval
> + .unwrap_or(DEFAULT_COLLECTION_INTERVAL)
> + }
> +
> + /// Return the number of allowed concurrent connections or the default if not configured.
> + pub fn max_concurrent_connections_or_default(&self) -> usize {
> + self.max_concurrent_connections
> + .unwrap_or(DEFAULT_CONCURRENCY)
> + }
> +
> + /// Return the upper bound for the main loop delay or the default if not configured.
> + pub fn max_interval_offset_or_default(&self) -> u64 {
> + self.max_interval_offset.unwrap_or(DEFAULT_MAX_OFFSET)
> + }
> +
> + /// Return the lower bound for the main loop delay or the default if not configured.
> + pub fn min_interval_offset_or_default(&self) -> u64 {
> + self.min_interval_offset.unwrap_or(DEFAULT_MIN_OFFSET)
> + }
> +
> + /// Return the upper bound for the connection delay or the default if not configured.
> + pub fn max_connection_delay_or_default(&self) -> u64 {
> + self.max_connection_delay
> + .unwrap_or(DEFAULT_MAX_CONNECTION_DELAY)
> + }
> +
> + /// Return the lower bound for the connection delay or the default if not configured.
> + pub fn min_connection_delay_or_default(&self) -> u64 {
> + self.min_connection_delay
> + .unwrap_or(DEFAULT_MIN_CONNECTION_DELAY)
> + }
> +}
> --
> 2.39.5
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v2 03/28] pdm-api-types: add CollectionSettings type
2025-02-18 15:26 ` Wolfgang Bumiller
@ 2025-02-18 15:31 ` Stefan Hanreich
2025-02-21 8:27 ` Lukas Wagner
1 sibling, 0 replies; 34+ messages in thread
From: Stefan Hanreich @ 2025-02-18 15:31 UTC (permalink / raw)
To: Proxmox Datacenter Manager development discussion,
Wolfgang Bumiller, Lukas Wagner
On 2/18/25 16:26, Wolfgang Bumiller wrote:
> On Fri, Feb 14, 2025 at 02:06:28PM +0100, Lukas Wagner wrote:
>> This commit adds the CollectionSettings type which holds settings for
>> the metric collection system. Included are collection interval, max
>> concurrency and upper/lower bounds for the metric collection loop.
>>
>> Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
>> ---
>> lib/pdm-api-types/src/lib.rs | 3 +
>> lib/pdm-api-types/src/metric_collection.rs | 188 +++++++++++++++++++++
>> 2 files changed, 191 insertions(+)
>> create mode 100644 lib/pdm-api-types/src/metric_collection.rs
>>
>> diff --git a/lib/pdm-api-types/src/lib.rs b/lib/pdm-api-types/src/lib.rs
>> index 38449071..6115e41c 100644
>> --- a/lib/pdm-api-types/src/lib.rs
>> +++ b/lib/pdm-api-types/src/lib.rs
>> @@ -19,6 +19,9 @@ pub use acl::*;
>> mod node_config;
>> pub use node_config::*;
>>
>> +mod metric_collection;
>> +pub use metric_collection::*;
>> +
>> mod proxy;
>> pub use proxy::HTTP_PROXY_SCHEMA;
>>
>> diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
>> new file mode 100644
>> index 00000000..92487d6c
>> --- /dev/null
>> +++ b/lib/pdm-api-types/src/metric_collection.rs
>> @@ -0,0 +1,188 @@
>> +//! API types for metric collection settings.
>> +
>> +use serde::{Deserialize, Serialize};
>> +
>> +use proxmox_schema::{api, Updater};
>> +
>> +/// Default metric collection interval.
>> +pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
>> +
>> +/// Minimum metric collection interval.
>> +pub const MIN_COLLECTION_INTERVAL: u64 = 10;
>> +
>> +/// Maximum metric collection interval.
>> +/// PVE and PBS keep 30 minutes of metric history,
>> +/// maximum is set to 25 minutes to leave some headroom.
>> +pub const MAX_COLLECTION_INTERVAL: u64 = 1500;
>> +
>> +/// Default number of concurrent connections.
>> +pub const DEFAULT_CONCURRENCY: usize = 10;
>> +/// Maximum number of concurrent connections.
>> +pub const MAX_CONCURRENCY: u64 = 50;
>> +/// Minimum number of concurrent connections (no concurrency).
>> +pub const MIN_CONCURRENCY: u64 = 1;
>> +
>> +/// Default upper bound for the random delay for the
>> +/// main metric collection loop.
>> +pub const DEFAULT_MAX_OFFSET: u64 = 10;
>> +/// Default lower bound for the random delay for the
>> +/// main metric collection loop.
>> +pub const DEFAULT_MIN_OFFSET: u64 = 0;
>> +
>> +/// Highest configurable upper bound for the random interval offset for the main loop.
>> +pub const MAX_INTERVAL_OFFSET: u64 = 300;
>> +/// Lowest configureable lower bound for the random interval offset for the main loop.
>> +pub const MIN_INTERVAL_OFFSET: u64 = 0;
>> +
>> +/// Default upper bound for the random individual connection delay.
>> +pub const DEFAULT_MAX_CONNECTION_DELAY: u64 = 100;
>> +/// Default lower bound for the random individual connection delay.
>> +pub const DEFAULT_MIN_CONNECTION_DELAY: u64 = 0;
>> +
>> +/// Highest configurable upper bound for the random individual connection delay.
>> +pub const MAX_CONNECTION_DELAY: u64 = 1000;
>> +/// Lowest configurable lower bound for the random individual connection delay.
>> +pub const MIN_CONNECTION_DELAY: u64 = 0;
>> +
>> +#[api(
>> + properties: {
>> + "collection-interval" : {
>> + optional: true,
>> + default: DEFAULT_COLLECTION_INTERVAL as isize,
>> + minimum: MIN_COLLECTION_INTERVAL as isize,
>> + maximum: MAX_COLLECTION_INTERVAL as isize,
>
> I thought about this the first time around but then forgot again:
>
> Given that it is possible we might change these types (see Stefan's
> patch for proxmox-schema) and because it is much more convenient
> anyway you could use `as _` for these casts.
fyi: I put it on the backburner for now, because the methods in SDN all
got a new parameter (because of locking). This removed the necessity of
the patch from my pov - but I can pick it up again today/tomorrow if we
want it somewhere else.
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v2 03/28] pdm-api-types: add CollectionSettings type
2025-02-18 15:26 ` Wolfgang Bumiller
2025-02-18 15:31 ` Stefan Hanreich
@ 2025-02-21 8:27 ` Lukas Wagner
1 sibling, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-02-21 8:27 UTC (permalink / raw)
To: Wolfgang Bumiller; +Cc: pdm-devel
On 2025-02-18 16:26, Wolfgang Bumiller wrote:
>> +
>> +#[api(
>> + properties: {
>> + "collection-interval" : {
>> + optional: true,
>> + default: DEFAULT_COLLECTION_INTERVAL as isize,
>> + minimum: MIN_COLLECTION_INTERVAL as isize,
>> + maximum: MAX_COLLECTION_INTERVAL as isize,
>
> I thought about this the first time around but then forgot again:
>
> Given that it is possible we might change these types (see Stefan's
> patch for proxmox-schema) and because it is much more convenient
> anyway you could use `as _` for these casts.
>
> While I'm usually not a fan of `_` casts - I think this might be okay
> for schema declarations.
>
> (No need for a v3 if there's nothing else...)
>
Ack, I'll send a follow up if nothing else comes up, else I'll fix it in v3.
--
- Lukas
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI)
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (27 preceding siblings ...)
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 28/28] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
@ 2025-02-21 13:19 ` Maximiliano Sandoval
2025-03-14 14:10 ` Lukas Wagner
29 siblings, 0 replies; 34+ messages in thread
From: Maximiliano Sandoval @ 2025-02-21 13:19 UTC (permalink / raw)
To: Proxmox Datacenter Manager development discussion
I went through the series and it looks good to me.
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Lukas Wagner <l.wagner@proxmox.com> writes:
> Key points:
> - fetch metrics concurrently
> - configuration for metric collection
> - new config /etc/proxmox-datacenter-manager/metric-collection.json
> - max-concurrency (number of allowed parallel connections)
> - collection-interval
> - randomized offset for collection start
> (min-interval-offset..max-interval-offset)
> - randomized per-connection delay
> (max-connection-delay..max-connection-delay)
> - Add some tests for the core logic in the metric collection system
> - Allow to trigger metric collection via the API
> - Record metric collection statistics in the RRD
> - overall collection time for all remotes
> - per remote response time when fetching metrics
> - Persist metric collection state to disk:
> /var/lib/proxmox-datacenter-manager/metric-collection-state.json
> (timestamps of last collection, errors)
> - Trigger metric collection for any new remotes added via the API
>
> - Add new API endpoints
> POST /metric-collection/trigger with optional 'remote' param
> GET /metric-collection/status
> GET/PUT /config/metric-collection/default
> GET /remotes/<remote>/metric-collection-rrddata
> GET /metric-collection/rrddata
>
> - Add CLI tooling
> proxmox-datacenter-client metric-collection settings show
> proxmox-datacenter-client metric-collection settings update
> proxmox-datacenter-client metric-collection trigger [--remote <remote>]
> proxmox-datacenter-client metric-collection status
>
>
> ## To reviewers / open questions:
> - Please review the defaults I've chosen for the settings, especially
> the ones for the default metric collection interval (10 minutes) as
> well as max-concurrency (10).
> I also kindly ask to double-check the naming of the properties.
> See "pdm-api-types: add CollectionSettings type" for details
>
> - Please review path and params for new API endpoints (anything public
> facing that is hard to change later)
>
> - I've chosen a section-config config now, even though we only have a
> single section for now. This was done for future-proofing reasons,
> maybe we want to add support for different setting 'groups' or
> something, e.g. to have different settings for distinct sets of
> remotes. Does this make sense?
> Or should I just stick to a simple config for now? (At moments like
> these I wish for TOML configs where we could be a bit more flexible...)
>
> collection-settings: default
> max-concurrency 10
> collection-interval 180
> min-interval-offset 0
> max-interval-offset 20
> min-connection-delay 10
> max-connection-delay 100
>
>
> - Should `GET /remotes/<remote>/metric-collection-rrddata` be
> just `rrddata`?
> not sure if we are going to add any other PDM-native per-remote
> metrics and whether we want to return that from the same API call
> as this...
>
> ## Potential future work
> - UI button for triggering metric collection
> - UI for metric collection settings
> - Show RRD graphs for metric collection stats somewhere
>
> ## Random offset/delay examples
> Example with 'max-concurrency' = 3 and 6 remotes.
>
> X ... timer triggered
> [ A ] .... fetching remote 'A'
> **** .... interval-offset (usually a couple of seconds)
> #### .... random worker delay (usually in millisecond range)
>
> /--########[ B ] ### [ C ]--\
> /---####[ A ] ###### [ D ]--------\
> ----X ************* ---/ ---###### [ E ] #########[ F ]--\----
>
> Changes since [v1]:
> - add missing dependency to librust-rand-dev to d/control
> - Fix a couple of minor spelling/punctuation issues (thx maximiliano)
> - Some minor code style improvments, e.g. using unwrap_or_else instead of doing
> a manual match
> - Document return values of 'setup_timer' function
> - Factor out handle_tick/handle_control_message
> - Minor refatoring/code style improvments
> - CLI: Change 'update-settings' to 'settings update'
> - CLI: Change 'show-settings' to 'settings show'
> - change missed tick behavior for tokio::time::Interval to 'skip' instead
> of burst.
>
> The last three commits are new in v2.
>
> [v1]: https://lore.proxmox.com/pdm-devel/20250211120541.163621-1-l.wagner@proxmox.com/T/#t
>
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI)
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
` (28 preceding siblings ...)
2025-02-21 13:19 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Maximiliano Sandoval
@ 2025-03-14 14:10 ` Lukas Wagner
29 siblings, 0 replies; 34+ messages in thread
From: Lukas Wagner @ 2025-03-14 14:10 UTC (permalink / raw)
To: pdm-devel
On 2025-02-14 14:06, Lukas Wagner wrote:
> ## To reviewers / open questions:
> - Please review the defaults I've chosen for the settings, especially
> the ones for the default metric collection interval (10 minutes) as
> well as max-concurrency (10).
> I also kindly ask to double-check the naming of the properties.
> See "pdm-api-types: add CollectionSettings type" for details
>
> - Please review path and params for new API endpoints (anything public
> facing that is hard to change later)
>
> - I've chosen a section-config config now, even though we only have a
> single section for now. This was done for future-proofing reasons,
> maybe we want to add support for different setting 'groups' or
> something, e.g. to have different settings for distinct sets of
> remotes. Does this make sense?
> Or should I just stick to a simple config for now? (At moments like
> these I wish for TOML configs where we could be a bit more flexible...)
>
> collection-settings: default
> max-concurrency 10
> collection-interval 180
> min-interval-offset 0
> max-interval-offset 20
> min-connection-delay 10
> max-connection-delay 100
>
Currently thinking about generalizing the `max-concurrency` setting into something global
that affects all 'background' polling operations (resource cache/task cache refreshes/
metric fetching).
For actually controlling the concurrency we could maybe have a globally available
semaphore (potential deadlock potential in some cases).
Alternatively, we could think about having a 'background request queue' and
a 'background request scheduler', which does the actual requests on behalf of
the other tasks.
I'll give this a bit more thoughts in the next days/weeks, so maybe don't merge
this in the meanwhile.
--
- Lukas
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2025-03-14 14:11 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-14 13:06 [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 01/28] test support: add NamedTempFile helper Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 02/28] test support: add NamedTempDir helper Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 03/28] pdm-api-types: add CollectionSettings type Lukas Wagner
2025-02-18 15:26 ` Wolfgang Bumiller
2025-02-18 15:31 ` Stefan Hanreich
2025-02-21 8:27 ` Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 04/28] pdm-config: add functions for reading/writing metric collection settings Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 05/28] metric collection: split top_entities split into separate module Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 06/28] metric collection: save metric data to RRD in separate task Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 07/28] metric collection: rework metric poll task Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 08/28] metric collection: persist state after metric collection Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 09/28] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 10/28] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 11/28] metric collection: add tests for the fetch_remotes function Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 12/28] metric collection: add test for fetch_overdue Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 13/28] metric collection: pass rrd cache instance as function parameter Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 14/28] metric collection: add test for rrd task Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 15/28] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 16/28] metric collection: record remote response time in metric database Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 17/28] metric collection: save time needed for collection run to RRD Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 18/28] metric collection: periodically clean removed remotes from statefile Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 19/28] api: add endpoint for updating metric collection settings Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 20/28] api: add endpoint to trigger metric collection Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 21/28] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 22/28] api: add api for querying metric collection RRD data Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 23/28] api: metric-collection: add status endpoint Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 24/28] pdm-client: add metric collection API methods Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 25/28] cli: add commands for metric-collection settings, trigger, status Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 26/28] metric collection: factor out handle_tick and handle_control_message fns Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 27/28] metric collection: skip missed timer ticks Lukas Wagner
2025-02-14 13:06 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 28/28] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
2025-02-21 13:19 ` [pdm-devel] [PATCH proxmox-datacenter-manager v2 00/28] metric collection improvements (concurrency, config, API, CLI) Maximiliano Sandoval
2025-03-14 14:10 ` Lukas Wagner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal