* [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
@ 2025-08-21 9:52 Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 01/23] metric collection: split top_entities split into separate module Lukas Wagner
` (25 more replies)
0 siblings, 26 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:52 UTC (permalink / raw)
To: pdm-devel
Key points:
- fetch metrics concurrently
- Add some tests for the core logic in the metric collection system
- Allow to trigger metric collection via the API
- Record metric collection statistics in the RRD
- overall collection time for all remotes
- per remote response time when fetching metrics
- Persist metric collection state to disk:
/var/lib/proxmox-datacenter-manager/metric-collection-state.json
(timestamps of last collection, errors)
- Trigger metric collection for any new remotes added via the API
- Add new API endpoints
POST /metric-collection/trigger with optional 'remote' param
GET /metric-collection/status
GET /remotes/<remote>/metric-collection-rrddata
GET /metric-collection/rrddata
- Add CLI tooling
proxmox-datacenter-client metric-collection trigger [--remote <remote>]
proxmox-datacenter-client metric-collection status
## To reviewers / open questions:
- Please review path and params for new API endpoints (anything public
facing that is hard to change later)
- Should `GET /remotes/<remote>/metric-collection-rrddata` be
just `rrddata`?
not sure if we are going to add any other PDM-native per-remote
metrics and whether we want to return that from the same API call
as this...
## Potential future work
- UI button for triggering metric collection
- UI for metric collection settings
- Show RRD graphs for metric collection stats somewhere
- Have some global concurrency control knob for background
requests [request scheduling].
Changes since [v5]:
- Rebased onto latest master
Changes since [v4]:
- Drop metric collection config file for now -
these might better be stored together with config for other
background tasks
Changes since [v3]:
- Rebase onto master
- Fix a couple clippy warnings (CreateOptions is now Copy!)
Changes since [v2]:
- For now, drop settings that might change any way with a
global background request scheduling system [request scheduling]:
- max-concurrency
- {min,max}-interval-offset
- {min,max}-connection-delay
Changes since [v1]:
- add missing dependency to librust-rand-dev to d/control
- Fix a couple of minor spelling/punctuation issues (thx maximiliano)
- Some minor code style improvments, e.g. using unwrap_or_else instead
of doing a manual match
- Document return values of 'setup_timer' function
- Factor out handle_tick/handle_control_message
- Minor refatoring/code style improvments
- CLI: Change 'update-settings' to 'settings update'
- CLI: Change 'show-settings' to 'settings show'
- change missed tick behavior for tokio::time::Interval to 'skip'
instead of burst.
The last three commits are new in v2.
[v1]: https://lore.proxmox.com/pdm-devel/20250211120541.163621-1-l.wagner@proxmox.com/T/#t
[v2]: https://lore.proxmox.com/pdm-devel/20250214130653.283012-1-l.wagner@proxmox.com/
[v3]: https://lore.proxmox.com/pdm-devel/20250416125642.291552-1-l.wagner@proxmox.com/T/#t
[v4]: https://lore.proxmox.com/pdm-devel/20250512133725.262263-1-l.wagner@proxmox.com/T/#t
[v5]: https://lore.proxmox.com/pdm-devel/c1a8deae-9590-471c-8505-d3e799bc7125@proxmox.com/T/#t
[request scheduling]: https://lore.proxmox.com/pdm-devel/7b3e90c8-6ebb-400f-acf9-cac084cc39fe@proxmox.com/
proxmox-datacenter-manager:
Lukas Wagner (23):
metric collection: split top_entities split into separate module
metric collection: save metric data to RRD in separate task
metric collection: rework metric poll task
metric collection: persist state after metric collection
metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
metric collection: collect overdue metrics on startup/timer change
metric collection: add tests for the fetch_remotes function
metric collection: add test for fetch_overdue
metric collection: pass rrd cache instance as function parameter
metric collection: add test for rrd task
metric collection: wrap rrd_cache::Cache in a struct
metric collection: record remote response time in metric database
metric collection: save time needed for collection run to RRD
metric collection: periodically clean removed remotes from statefile
api: add endpoint to trigger metric collection
api: remotes: trigger immediate metric collection for newly added
nodes
api: add api for querying metric collection RRD data
api: metric-collection: add status endpoint
pdm-client: add metric collection API methods
cli: add commands for metric-collection trigger and status
metric collection: factor out handle_tick and handle_control_message
fns
metric collection: skip missed timer ticks
metric collection: use JoinSet instead of joining from handles in a
Vec
cli/client/Cargo.toml | 1 +
cli/client/src/main.rs | 2 +
cli/client/src/metric_collection.rs | 70 ++
lib/pdm-api-types/src/lib.rs | 3 +
lib/pdm-api-types/src/metric_collection.rs | 20 +
lib/pdm-api-types/src/rrddata.rs | 26 +
lib/pdm-client/src/lib.rs | 58 ++
server/src/api/metric_collection.rs | 99 +++
server/src/api/mod.rs | 2 +
server/src/api/remotes.rs | 59 ++
server/src/api/resources.rs | 3 +-
server/src/api/rrd_common.rs | 11 +-
server/src/bin/proxmox-datacenter-api/main.rs | 2 +-
.../src/metric_collection/collection_task.rs | 656 ++++++++++++++++++
server/src/metric_collection/mod.rs | 346 +++------
server/src/metric_collection/rrd_cache.rs | 204 +++---
server/src/metric_collection/rrd_task.rs | 289 ++++++++
server/src/metric_collection/state.rs | 150 ++++
server/src/metric_collection/top_entities.rs | 150 ++++
19 files changed, 1783 insertions(+), 368 deletions(-)
create mode 100644 cli/client/src/metric_collection.rs
create mode 100644 lib/pdm-api-types/src/metric_collection.rs
create mode 100644 server/src/api/metric_collection.rs
create mode 100644 server/src/metric_collection/collection_task.rs
create mode 100644 server/src/metric_collection/rrd_task.rs
create mode 100644 server/src/metric_collection/state.rs
create mode 100644 server/src/metric_collection/top_entities.rs
Summary over all repositories:
19 files changed, 1783 insertions(+), 368 deletions(-)
--
Generated by murpp 0.9.0
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 01/23] metric collection: split top_entities split into separate module
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
@ 2025-08-21 9:52 ` Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 02/23] metric collection: save metric data to RRD in separate task Lukas Wagner
` (24 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:52 UTC (permalink / raw)
To: pdm-devel
This makes the parent module a bit more easy not navigate.
No functional changes intended.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
server/src/api/resources.rs | 3 +-
server/src/metric_collection/mod.rs | 149 +------------------
server/src/metric_collection/top_entities.rs | 148 ++++++++++++++++++
3 files changed, 152 insertions(+), 148 deletions(-)
create mode 100644 server/src/metric_collection/top_entities.rs
diff --git a/server/src/api/resources.rs b/server/src/api/resources.rs
index 6a8c8ef2..13a959a6 100644
--- a/server/src/api/resources.rs
+++ b/server/src/api/resources.rs
@@ -26,6 +26,7 @@ use proxmox_subscription::SubscriptionStatus;
use pve_api_types::{ClusterResource, ClusterResourceType};
use crate::connection;
+use crate::metric_collection::top_entities;
pub const ROUTER: Router = Router::new()
.get(&list_subdirs_api_method!(SUBDIRS))
@@ -320,7 +321,7 @@ async fn get_top_entities(timeframe: Option<RrdTimeframe>) -> Result<TopEntities
let (remotes_config, _) = pdm_config::remotes::config()?;
let timeframe = timeframe.unwrap_or(RrdTimeframe::Day);
- let res = crate::metric_collection::calculate_top(&remotes_config, timeframe, 10);
+ let res = top_entities::calculate_top(&remotes_config, timeframe, 10);
Ok(res)
}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index f62e2a63..588f7ad5 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -6,15 +6,13 @@ use anyhow::Error;
use pbs_api_types::{MetricDataPoint, MetricDataType};
use proxmox_rrd::rrd::DataSourceType;
-use pdm_api_types::{
- remotes::RemoteType,
- resource::{Resource, ResourceRrdData, TopEntities, TopEntity},
-};
+use pdm_api_types::remotes::RemoteType;
use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
use crate::{connection, task_utils};
pub mod rrd_cache;
+pub mod top_entities;
const COLLECTION_INTERVAL: u64 = 60;
@@ -150,146 +148,3 @@ fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
data_source_type,
);
}
-
-fn insert_sorted<T>(vec: &mut Vec<(usize, T)>, value: (usize, T), limit: usize) {
- let index = match vec.binary_search_by_key(&value.0, |(idx, _)| *idx) {
- Ok(idx) | Err(idx) => idx,
- };
-
- vec.insert(index, value);
- if vec.len() > limit {
- for _ in 0..(vec.len() - limit) {
- vec.remove(0);
- }
- }
-}
-
-// for now simple sum of the values => area under the graph curve
-fn calculate_coefficient(values: &proxmox_rrd::Entry) -> f64 {
- let mut coefficient = 0.0;
- for point in values.data.iter() {
- let value = point.unwrap_or_default();
- if value.is_finite() {
- coefficient += value;
- }
- }
-
- coefficient
-}
-
-// FIXME: cache the values instead of calculate freshly every time?
-// FIXME: find better way to enumerate nodes/guests/etc.(instead of relying on the cache)
-pub fn calculate_top(
- remotes: &HashMap<String, pdm_api_types::remotes::Remote>,
- timeframe: proxmox_rrd_api_types::RrdTimeframe,
- num: usize,
-) -> TopEntities {
- let mut guest_cpu = Vec::new();
- let mut node_cpu = Vec::new();
- let mut node_memory = Vec::new();
-
- for remote_name in remotes.keys() {
- if let Some(data) =
- crate::api::resources::get_cached_resources(remote_name, i64::MAX as u64)
- {
- for res in data.resources {
- let id = res.id().to_string();
- let name = format!("pve/{remote_name}/{id}");
- match &res {
- Resource::PveStorage(_) => {}
- Resource::PveQemu(_) | Resource::PveLxc(_) => {
- if let Some(entity) =
- get_entity(timeframe, remote_name, res, name, "cpu_current")
- {
- let coefficient = (entity.0 * 100.0).round() as usize;
- insert_sorted(&mut guest_cpu, (coefficient, entity.1), num);
- }
- }
- Resource::PveNode(_) => {
- if let Some(entity) = get_entity(
- timeframe,
- remote_name,
- res.clone(),
- name.clone(),
- "cpu_current",
- ) {
- let coefficient = (entity.0 * 100.0).round() as usize;
- insert_sorted(&mut node_cpu, (coefficient, entity.1), num);
- }
- // convert mem/mem_total into a single entity
- if let Some(mut mem) = get_entity(
- timeframe,
- remote_name,
- res.clone(),
- name.clone(),
- "mem_used",
- ) {
- if let Some(mem_total) =
- get_entity(timeframe, remote_name, res, name, "mem_total")
- {
- // skip if we don't have the same amount of data for used and total
- let mem_rrd = &mem.1.rrd_data.data;
- let mem_total_rrd = &mem_total.1.rrd_data.data;
- if mem_rrd.len() != mem_total_rrd.len() {
- continue;
- }
- let coefficient = (100.0 * mem.0 / mem_total.0).round() as usize;
- let mut mem_usage = Vec::new();
- for i in 0..mem_rrd.len() {
- let point = match (mem_rrd[i], mem_total_rrd[i]) {
- (Some(mem), Some(total)) => Some(mem / total),
- _ => None,
- };
- mem_usage.push(point)
- }
- mem.1.rrd_data.data = mem_usage;
- insert_sorted(&mut node_memory, (coefficient, mem.1), num);
- }
- }
- }
- Resource::PbsNode(_) => {}
- Resource::PbsDatastore(_) => {}
- }
- }
- }
- }
-
- TopEntities {
- guest_cpu: guest_cpu.into_iter().map(|(_, entity)| entity).collect(),
- node_cpu: node_cpu.into_iter().map(|(_, entity)| entity).collect(),
- node_memory: node_memory.into_iter().map(|(_, entity)| entity).collect(),
- }
-}
-
-fn get_entity(
- timeframe: proxmox_rrd_api_types::RrdTimeframe,
- remote_name: &String,
- res: Resource,
- name: String,
- metric: &str,
-) -> Option<(f64, TopEntity)> {
- if let Ok(Some(values)) = rrd_cache::extract_data(
- &name,
- metric,
- timeframe,
- proxmox_rrd_api_types::RrdMode::Average,
- ) {
- let coefficient = calculate_coefficient(&values);
- if coefficient > 0.0 {
- return Some((
- coefficient,
- TopEntity {
- remote: remote_name.to_string(),
- resource: res,
- rrd_data: ResourceRrdData {
- start: values.start,
- resolution: values.resolution,
- data: values.data,
- },
- },
- ));
- }
- }
-
- None
-}
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
new file mode 100644
index 00000000..f8e053fb
--- /dev/null
+++ b/server/src/metric_collection/top_entities.rs
@@ -0,0 +1,148 @@
+use std::collections::HashMap;
+
+use pdm_api_types::resource::{Resource, ResourceRrdData, TopEntities, TopEntity};
+
+use super::rrd_cache;
+
+fn insert_sorted<T>(vec: &mut Vec<(usize, T)>, value: (usize, T), limit: usize) {
+ let index = match vec.binary_search_by_key(&value.0, |(idx, _)| *idx) {
+ Ok(idx) | Err(idx) => idx,
+ };
+
+ vec.insert(index, value);
+ if vec.len() > limit {
+ for _ in 0..(vec.len() - limit) {
+ vec.remove(0);
+ }
+ }
+}
+
+// for now simple sum of the values => area under the graph curve
+fn calculate_coefficient(values: &proxmox_rrd::Entry) -> f64 {
+ let mut coefficient = 0.0;
+ for point in values.data.iter() {
+ let value = point.unwrap_or_default();
+ if value.is_finite() {
+ coefficient += value;
+ }
+ }
+
+ coefficient
+}
+
+// FIXME: cache the values instead of calculate freshly every time?
+// FIXME: find better way to enumerate nodes/guests/etc.(instead of relying on the cache)
+pub fn calculate_top(
+ remotes: &HashMap<String, pdm_api_types::remotes::Remote>,
+ timeframe: proxmox_rrd_api_types::RrdTimeframe,
+ num: usize,
+) -> TopEntities {
+ let mut guest_cpu = Vec::new();
+ let mut node_cpu = Vec::new();
+ let mut node_memory = Vec::new();
+
+ for remote_name in remotes.keys() {
+ if let Some(data) =
+ crate::api::resources::get_cached_resources(remote_name, i64::MAX as u64)
+ {
+ for res in data.resources {
+ let id = res.id().to_string();
+ let name = format!("pve/{remote_name}/{id}");
+ match &res {
+ Resource::PveStorage(_) => {}
+ Resource::PveQemu(_) | Resource::PveLxc(_) => {
+ if let Some(entity) =
+ get_entity(timeframe, remote_name, res, name, "cpu_current")
+ {
+ let coefficient = (entity.0 * 100.0).round() as usize;
+ insert_sorted(&mut guest_cpu, (coefficient, entity.1), num);
+ }
+ }
+ Resource::PveNode(_) => {
+ if let Some(entity) = get_entity(
+ timeframe,
+ remote_name,
+ res.clone(),
+ name.clone(),
+ "cpu_current",
+ ) {
+ let coefficient = (entity.0 * 100.0).round() as usize;
+ insert_sorted(&mut node_cpu, (coefficient, entity.1), num);
+ }
+ // convert mem/mem_total into a single entity
+ if let Some(mut mem) = get_entity(
+ timeframe,
+ remote_name,
+ res.clone(),
+ name.clone(),
+ "mem_used",
+ ) {
+ if let Some(mem_total) =
+ get_entity(timeframe, remote_name, res, name, "mem_total")
+ {
+ // skip if we don't have the same amount of data for used and total
+ let mem_rrd = &mem.1.rrd_data.data;
+ let mem_total_rrd = &mem_total.1.rrd_data.data;
+ if mem_rrd.len() != mem_total_rrd.len() {
+ continue;
+ }
+ let coefficient = (100.0 * mem.0 / mem_total.0).round() as usize;
+ let mut mem_usage = Vec::new();
+ for i in 0..mem_rrd.len() {
+ let point = match (mem_rrd[i], mem_total_rrd[i]) {
+ (Some(mem), Some(total)) => Some(mem / total),
+ _ => None,
+ };
+ mem_usage.push(point)
+ }
+ mem.1.rrd_data.data = mem_usage;
+ insert_sorted(&mut node_memory, (coefficient, mem.1), num);
+ }
+ }
+ }
+ Resource::PbsNode(_) => {}
+ Resource::PbsDatastore(_) => {}
+ }
+ }
+ }
+ }
+
+ TopEntities {
+ guest_cpu: guest_cpu.into_iter().map(|(_, entity)| entity).collect(),
+ node_cpu: node_cpu.into_iter().map(|(_, entity)| entity).collect(),
+ node_memory: node_memory.into_iter().map(|(_, entity)| entity).collect(),
+ }
+}
+
+fn get_entity(
+ timeframe: proxmox_rrd_api_types::RrdTimeframe,
+ remote_name: &String,
+ res: Resource,
+ name: String,
+ metric: &str,
+) -> Option<(f64, TopEntity)> {
+ if let Ok(Some(values)) = rrd_cache::extract_data(
+ &name,
+ metric,
+ timeframe,
+ proxmox_rrd_api_types::RrdMode::Average,
+ ) {
+ let coefficient = calculate_coefficient(&values);
+ if coefficient > 0.0 {
+ return Some((
+ coefficient,
+ TopEntity {
+ remote: remote_name.to_string(),
+ resource: res,
+ rrd_data: ResourceRrdData {
+ start: values.start,
+ resolution: values.resolution,
+ data: values.data,
+ },
+ },
+ ));
+ }
+ }
+
+ None
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 02/23] metric collection: save metric data to RRD in separate task
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 01/23] metric collection: split top_entities split into separate module Lukas Wagner
@ 2025-08-21 9:52 ` Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 03/23] metric collection: rework metric poll task Lukas Wagner
` (23 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:52 UTC (permalink / raw)
To: pdm-devel
This is a preparation for concurrent metric collection. While the RRD
cache appears to be safe to concurrent access (it uses an RwLock to
protect the data), delegating all RRD writes to a separate tokio task
makes it IMO easier to reason about and understand. Furthermore, it
decouples the 'metric fetching' and 'metric
storing' parts, making it much easier to write tests for both parts
independently.
For better code separation, this also splits out the new task into
a new submodule.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
server/src/metric_collection/mod.rs | 102 +++++++----------------
server/src/metric_collection/rrd_task.rs | 96 +++++++++++++++++++++
2 files changed, 128 insertions(+), 70 deletions(-)
create mode 100644 server/src/metric_collection/rrd_task.rs
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 588f7ad5..9c75f39c 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -2,18 +2,18 @@ use std::collections::HashMap;
use std::pin::pin;
use anyhow::Error;
-
-use pbs_api_types::{MetricDataPoint, MetricDataType};
-use proxmox_rrd::rrd::DataSourceType;
+use tokio::sync::mpsc::{self, Sender};
use pdm_api_types::remotes::RemoteType;
-use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
use crate::{connection, task_utils};
pub mod rrd_cache;
+mod rrd_task;
pub mod top_entities;
+use rrd_task::RrdStoreRequest;
+
const COLLECTION_INTERVAL: u64 = 60;
/// Initialize the RRD cache
@@ -25,14 +25,22 @@ pub fn init() -> Result<(), Error> {
/// Start the metric collection task.
pub fn start_task() {
+ let (tx, rx) = mpsc::channel(128);
+
tokio::spawn(async move {
- let task_scheduler = pin!(metric_collection_task());
+ let task_scheduler = pin!(metric_collection_task(tx));
+ let abort_future = pin!(proxmox_daemon::shutdown_future());
+ futures::future::select(task_scheduler, abort_future).await;
+ });
+
+ tokio::spawn(async move {
+ let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
let abort_future = pin!(proxmox_daemon::shutdown_future());
futures::future::select(task_scheduler, abort_future).await;
});
}
-async fn metric_collection_task() -> Result<(), Error> {
+async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
loop {
@@ -59,37 +67,33 @@ async fn metric_collection_task() -> Result<(), Error> {
.cluster_metrics_export(Some(true), Some(false), Some(start_time))
.await?;
- //// Involves some blocking file IO
- tokio::task::spawn_blocking(move || {
- let mut most_recent_timestamp = 0;
+ let most_recent =
+ metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
- for data_point in metrics.data {
- most_recent_timestamp =
- most_recent_timestamp.max(data_point.timestamp);
- store_metric_pve(&remote_name_clone, &data_point);
- }
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote_name_clone,
+ metrics,
+ })
+ .await?;
- most_recent_timestamp
- })
- .await
+ Ok::<i64, Error>(most_recent)
}
RemoteType::Pbs => {
let client = connection::make_pbs_client(&remote)?;
let metrics = client.metrics(Some(true), Some(start_time)).await?;
- // Involves some blocking file IO
- tokio::task::spawn_blocking(move || {
- let mut most_recent_timestamp = 0;
+ let most_recent =
+ metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
- for data_point in metrics.data {
- most_recent_timestamp =
- most_recent_timestamp.max(data_point.timestamp);
- store_metric_pbs(&remote_name_clone, &data_point);
- }
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote_name_clone,
+ metrics,
+ })
+ .await?;
- most_recent_timestamp
- })
- .await
+ Ok::<i64, Error>(most_recent)
}
}?;
@@ -106,45 +110,3 @@ async fn metric_collection_task() -> Result<(), Error> {
}
}
}
-
-fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
- let name = format!(
- "pve/{remote_name}/{id}/{metric}",
- id = data_point.id,
- metric = data_point.metric,
- );
-
- let data_source_type = match data_point.ty {
- ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
- ClusterMetricsDataType::Counter => DataSourceType::Counter,
- ClusterMetricsDataType::Derive => DataSourceType::Derive,
- };
-
- rrd_cache::update_value(
- &name,
- data_point.value,
- data_point.timestamp,
- data_source_type,
- );
-}
-
-fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
- let name = format!(
- "pbs/{remote_name}/{id}/{metric}",
- id = data_point.id,
- metric = data_point.metric,
- );
-
- let data_source_type = match data_point.ty {
- MetricDataType::Gauge => DataSourceType::Gauge,
- MetricDataType::Counter => DataSourceType::Counter,
- MetricDataType::Derive => DataSourceType::Derive,
- };
-
- rrd_cache::update_value(
- &name,
- data_point.value,
- data_point.timestamp,
- data_source_type,
- );
-}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
new file mode 100644
index 00000000..a72945df
--- /dev/null
+++ b/server/src/metric_collection/rrd_task.rs
@@ -0,0 +1,96 @@
+use anyhow::Error;
+use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
+use proxmox_rrd::rrd::DataSourceType;
+use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
+use tokio::sync::mpsc::Receiver;
+
+use super::rrd_cache;
+
+/// Store request for the RRD task.
+pub(super) enum RrdStoreRequest {
+ /// Store PVE metrics.
+ Pve {
+ /// Remote name.
+ remote: String,
+ /// Metric data.
+ metrics: ClusterMetrics,
+ },
+ /// Store PBS metrics.
+ Pbs {
+ /// Remote name.
+ remote: String,
+ /// Metric data.
+ metrics: Metrics,
+ },
+}
+
+/// Task which stores received metrics in the RRD. Metric data is fed into
+/// this task via a MPSC channel.
+pub(super) async fn store_in_rrd_task(
+ mut receiver: Receiver<RrdStoreRequest>,
+) -> Result<(), Error> {
+ while let Some(msg) = receiver.recv().await {
+ // Involves some blocking file IO
+ let res = tokio::task::spawn_blocking(move || match msg {
+ RrdStoreRequest::Pve { remote, metrics } => {
+ for data_point in metrics.data {
+ store_metric_pve(&remote, &data_point);
+ }
+ }
+ RrdStoreRequest::Pbs { remote, metrics } => {
+ for data_point in metrics.data {
+ store_metric_pbs(&remote, &data_point);
+ }
+ }
+ })
+ .await;
+
+ if let Err(err) = res {
+ log::error!("error in rrd task when attempting to save metrics: {err}");
+ }
+ }
+
+ Ok(())
+}
+
+fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
+ let name = format!(
+ "pve/{remote_name}/{id}/{metric}",
+ id = data_point.id,
+ metric = data_point.metric,
+ );
+
+ let data_source_type = match data_point.ty {
+ ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
+ ClusterMetricsDataType::Counter => DataSourceType::Counter,
+ ClusterMetricsDataType::Derive => DataSourceType::Derive,
+ };
+
+ rrd_cache::update_value(
+ &name,
+ data_point.value,
+ data_point.timestamp,
+ data_source_type,
+ );
+}
+
+fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
+ let name = format!(
+ "pbs/{remote_name}/{id}/{metric}",
+ id = data_point.id,
+ metric = data_point.metric,
+ );
+
+ let data_source_type = match data_point.ty {
+ MetricDataType::Gauge => DataSourceType::Gauge,
+ MetricDataType::Counter => DataSourceType::Counter,
+ MetricDataType::Derive => DataSourceType::Derive,
+ };
+
+ rrd_cache::update_value(
+ &name,
+ data_point.value,
+ data_point.timestamp,
+ data_source_type,
+ );
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 03/23] metric collection: rework metric poll task
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 01/23] metric collection: split top_entities split into separate module Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 02/23] metric collection: save metric data to RRD in separate task Lukas Wagner
@ 2025-08-21 9:52 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 04/23] metric collection: persist state after metric collection Lukas Wagner
` (22 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:52 UTC (permalink / raw)
To: pdm-devel
Metric collection is mostly limited by network latency, not disk IO or
CPU. This means we should be able get significant speedups by polling
remotes concurrently by spawning tokio workers.
To avoid load/IO spikes, we limit the number of concurrent connections
using a semaphore.
Furthermore, we add a mechanism to trigger metric collection manually.
This is achieve by using `tokio::select!` on both, the timer's `tick`
method and the `recv` method of an `mpsc` which is used to send control
messages to the metric collection task.
For better code structure, the collection task is split into a separate
module.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
Notes:
Changes since v1:
- get_settings_or_default function: use unwrap_or_default instead
of match
- Add `rand` crate dependency to d/control
- Refactor sleep helpers, consolidate into a single
parameterised funtion
- Get rid of .unwrap when spawning up the metric collection task
- Let trigger_metric_collection_for_remote take a String instead of
&str
Changes since v2:
- Use hardcoded default for max-concurrency for now. This
might be replaced by some global knob, affecting all background tasks
- Drop interval delays/jitter for now, this also might be affected
by a global request scheduler
server/src/bin/proxmox-datacenter-api/main.rs | 2 +-
.../src/metric_collection/collection_task.rs | 216 ++++++++++++++++++
server/src/metric_collection/mod.rs | 145 +++++-------
3 files changed, 271 insertions(+), 92 deletions(-)
create mode 100644 server/src/metric_collection/collection_task.rs
diff --git a/server/src/bin/proxmox-datacenter-api/main.rs b/server/src/bin/proxmox-datacenter-api/main.rs
index 42bc0e1e..c0f0e3a4 100644
--- a/server/src/bin/proxmox-datacenter-api/main.rs
+++ b/server/src/bin/proxmox-datacenter-api/main.rs
@@ -373,7 +373,7 @@ async fn run(debug: bool) -> Result<(), Error> {
});
start_task_scheduler();
- metric_collection::start_task();
+ metric_collection::start_task()?;
tasks::remote_node_mapping::start_task();
resource_cache::start_task();
tasks::remote_tasks::start_task()?;
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
new file mode 100644
index 00000000..a7420edb
--- /dev/null
+++ b/server/src/metric_collection/collection_task.rs
@@ -0,0 +1,216 @@
+use std::{collections::HashMap, sync::Arc, time::Duration};
+
+use anyhow::Error;
+use tokio::{
+ sync::{
+ mpsc::{Receiver, Sender},
+ OwnedSemaphorePermit, Semaphore,
+ },
+ time::Interval,
+};
+
+use proxmox_section_config::typed::SectionConfigData;
+
+use pdm_api_types::remotes::{Remote, RemoteType};
+
+use crate::{connection, task_utils};
+
+use super::rrd_task::RrdStoreRequest;
+
+pub const MAX_CONCURRENT_CONNECTIONS: usize = 20;
+
+/// Default metric collection interval.
+pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
+
+/// Control messages for the metric collection task.
+pub(super) enum ControlMsg {
+ CollectSingleRemote(String),
+ CollectAllRemotes,
+}
+
+/// Task which periodically collects metrics from all remotes and stores
+/// them in the local metrics database.
+pub(super) struct MetricCollectionTask {
+ most_recent_timestamps: HashMap<String, i64>,
+ metric_data_tx: Sender<RrdStoreRequest>,
+ control_message_rx: Receiver<ControlMsg>,
+}
+
+impl MetricCollectionTask {
+ /// Create a new metric collection task.
+ pub(super) fn new(
+ metric_data_tx: Sender<RrdStoreRequest>,
+ control_message_rx: Receiver<ControlMsg>,
+ ) -> Result<Self, Error> {
+ Ok(Self {
+ most_recent_timestamps: HashMap::new(),
+ metric_data_tx,
+ control_message_rx,
+ })
+ }
+
+ /// Run the metric collection task.
+ ///
+ /// This function never returns.
+ #[tracing::instrument(skip_all, name = "metric_collection_task")]
+ pub(super) async fn run(&mut self) {
+ let mut timer = Self::setup_timer(DEFAULT_COLLECTION_INTERVAL);
+
+ log::debug!(
+ "metric collection starting up. Collection interval set to {} seconds.",
+ DEFAULT_COLLECTION_INTERVAL,
+ );
+
+ loop {
+ tokio::select! {
+ _ = timer.tick() => {
+ log::debug!("starting metric collection from all remotes - triggered by timer");
+
+ if let Some(remotes) = Self::load_remote_config() {
+ let to_fetch = remotes.iter().map(|(name, _)| name.into()).collect::<Vec<String>>();
+ self.fetch_remotes(&remotes, &to_fetch).await;
+ }
+ }
+
+ val = self.control_message_rx.recv() => {
+ // Reload settings in case they have changed in the meanwhile
+ match val {
+ Some(ControlMsg::CollectSingleRemote(remote)) => {
+ if let Some(remotes) = Self::load_remote_config() {
+ log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
+ self.fetch_remotes(&remotes, &[remote]).await;
+ }
+ }
+ Some(ControlMsg::CollectAllRemotes) => {
+ if let Some(remotes) = Self::load_remote_config() {
+ log::debug!("starting metric collection from all remotes - triggered by control message");
+ let to_fetch = remotes.iter().map(|(name, _)| name.into()).collect::<Vec<String>>();
+ self.fetch_remotes(&remotes, &to_fetch).await;
+ }
+ }
+ _ => {},
+ }
+ }
+ }
+ }
+ }
+
+ /// Set up a [`tokio::time::Interval`] instance with the provided interval.
+ /// The timer will be aligned, e.g. an interval of `60` will let the timer
+ /// fire at minute boundaries.
+ fn setup_timer(interval: u64) -> Interval {
+ let mut timer = tokio::time::interval(Duration::from_secs(interval));
+ let first_run = task_utils::next_aligned_instant(interval).into();
+ timer.reset_at(first_run);
+
+ timer
+ }
+
+ /// Convenience helper to load `remote.cfg`, logging the error
+ /// and returning `None` if the config could not be read.
+ fn load_remote_config() -> Option<SectionConfigData<Remote>> {
+ match pdm_config::remotes::config() {
+ Ok((remotes, _)) => Some(remotes),
+ Err(e) => {
+ log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
+ None
+ }
+ }
+ }
+
+ /// Fetch metric data from a provided list of remotes concurrently.
+ /// The maximum number of concurrent connections is determined by
+ /// `max_concurrent_connections` in the [`CollectionSettings`]
+ /// instance in `self`.
+ async fn fetch_remotes(
+ &mut self,
+ remote_config: &SectionConfigData<Remote>,
+ remotes_to_fetch: &[String],
+ ) {
+ let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT_CONNECTIONS));
+ let mut handles = Vec::new();
+
+ for remote_name in remotes_to_fetch {
+ let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
+
+ // unwrap is okay here, acquire_* will only fail if `close` has been
+ // called on the semaphore.
+ let permit = Arc::clone(&semaphore).acquire_owned().await.unwrap();
+
+ if let Some(remote) = remote_config.get(remote_name).cloned() {
+ log::debug!("fetching remote '{}'", remote.id);
+ let handle = tokio::spawn(Self::fetch_single_remote(
+ remote,
+ start_time,
+ self.metric_data_tx.clone(),
+ permit,
+ ));
+
+ handles.push((remote_name.clone(), handle));
+ }
+ }
+
+ for (remote_name, handle) in handles {
+ let res = handle.await;
+
+ match res {
+ Ok(Ok(ts)) => {
+ self.most_recent_timestamps
+ .insert(remote_name.to_string(), ts);
+ }
+ Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
+ Err(err) => {
+ log::error!(
+ "join error for metric collection task for remote {remote_name}: {err}"
+ )
+ }
+ }
+ }
+ }
+
+ /// Fetch a single remote.
+ #[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
+ async fn fetch_single_remote(
+ remote: Remote,
+ start_time: i64,
+ sender: Sender<RrdStoreRequest>,
+ _permit: OwnedSemaphorePermit,
+ ) -> Result<i64, Error> {
+ let most_recent_timestamp = match remote.ty {
+ RemoteType::Pve => {
+ let client = connection::make_pve_client(&remote)?;
+ let metrics = client
+ .cluster_metrics_export(Some(true), Some(false), Some(start_time))
+ .await?;
+
+ let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote.id.clone(),
+ metrics,
+ })
+ .await?;
+
+ most_recent
+ }
+ RemoteType::Pbs => {
+ let client = connection::make_pbs_client(&remote)?;
+ let metrics = client.metrics(Some(true), Some(start_time)).await?;
+
+ let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote.id.clone(),
+ metrics,
+ })
+ .await?;
+
+ most_recent
+ }
+ };
+
+ Ok(most_recent_timestamp)
+ }
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 9c75f39c..9b203615 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -1,20 +1,17 @@
-use std::collections::HashMap;
use std::pin::pin;
+use std::sync::OnceLock;
-use anyhow::Error;
+use anyhow::{bail, Error};
use tokio::sync::mpsc::{self, Sender};
-use pdm_api_types::remotes::RemoteType;
-
-use crate::{connection, task_utils};
-
+mod collection_task;
pub mod rrd_cache;
mod rrd_task;
pub mod top_entities;
-use rrd_task::RrdStoreRequest;
+use collection_task::{ControlMsg, MetricCollectionTask};
-const COLLECTION_INTERVAL: u64 = 60;
+static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
/// Initialize the RRD cache
pub fn init() -> Result<(), Error> {
@@ -24,89 +21,55 @@ pub fn init() -> Result<(), Error> {
}
/// Start the metric collection task.
-pub fn start_task() {
- let (tx, rx) = mpsc::channel(128);
+pub fn start_task() -> Result<(), Error> {
+ let (metric_data_tx, metric_data_rx) = mpsc::channel(128);
- tokio::spawn(async move {
- let task_scheduler = pin!(metric_collection_task(tx));
- let abort_future = pin!(proxmox_daemon::shutdown_future());
- futures::future::select(task_scheduler, abort_future).await;
- });
-
- tokio::spawn(async move {
- let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
- let abort_future = pin!(proxmox_daemon::shutdown_future());
- futures::future::select(task_scheduler, abort_future).await;
- });
-}
-
-async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
- let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
-
- loop {
- let delay_target = task_utils::next_aligned_instant(COLLECTION_INTERVAL);
- tokio::time::sleep_until(tokio::time::Instant::from_std(delay_target)).await;
-
- let remotes = match pdm_config::remotes::config() {
- Ok((remotes, _)) => remotes,
- Err(e) => {
- log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
- continue;
- }
- };
-
- for (remote_name, remote) in remotes {
- let start_time = *most_recent_timestamps.get(&remote_name).unwrap_or(&0);
- let remote_name_clone = remote_name.clone();
-
- let res = async {
- let most_recent_timestamp = match remote.ty {
- RemoteType::Pve => {
- let client = connection::make_pve_client(&remote)?;
- let metrics = client
- .cluster_metrics_export(Some(true), Some(false), Some(start_time))
- .await?;
-
- let most_recent =
- metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
- sender
- .send(RrdStoreRequest::Pve {
- remote: remote_name_clone,
- metrics,
- })
- .await?;
-
- Ok::<i64, Error>(most_recent)
- }
- RemoteType::Pbs => {
- let client = connection::make_pbs_client(&remote)?;
- let metrics = client.metrics(Some(true), Some(start_time)).await?;
-
- let most_recent =
- metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
- sender
- .send(RrdStoreRequest::Pbs {
- remote: remote_name_clone,
- metrics,
- })
- .await?;
-
- Ok::<i64, Error>(most_recent)
- }
- }?;
-
- Ok::<i64, Error>(most_recent_timestamp)
- }
- .await;
-
- match res {
- Ok(ts) => {
- most_recent_timestamps.insert(remote_name.to_string(), ts);
- }
- Err(err) => log::error!("failed to collect metrics for {remote_name}: {err}"),
- }
- }
+ let (trigger_collection_tx, trigger_collection_rx) = mpsc::channel(128);
+ if CONTROL_MESSAGE_TX.set(trigger_collection_tx).is_err() {
+ bail!("control message sender alread set");
}
+
+ tokio::spawn(async move {
+ let metric_collection_task_future = pin!(async move {
+ match MetricCollectionTask::new(metric_data_tx, trigger_collection_rx) {
+ Ok(mut task) => task.run().await,
+ Err(err) => log::error!("could not start metric collection task: {err}"),
+ }
+ });
+
+ let abort_future = pin!(proxmox_daemon::shutdown_future());
+ futures::future::select(metric_collection_task_future, abort_future).await;
+ });
+
+ tokio::spawn(async move {
+ let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
+ let abort_future = pin!(proxmox_daemon::shutdown_future());
+ futures::future::select(rrd_task_future, abort_future).await;
+ });
+
+ Ok(())
+}
+
+/// Schedule metric collection for a given remote as soon as possible.
+///
+/// Has no effect if the tx end of the channel has not been initialized yet.
+/// Returns an error if the mpsc channel has been closed already.
+pub async fn trigger_metric_collection_for_remote(remote: String) -> Result<(), Error> {
+ if let Some(sender) = CONTROL_MESSAGE_TX.get() {
+ sender.send(ControlMsg::CollectSingleRemote(remote)).await?;
+ }
+
+ Ok(())
+}
+
+/// Schedule metric collection for all remotes as soon as possible.
+///
+/// Has no effect if the tx end of the channel has not been initialized yet.
+/// Returns an error if the mpsc channel has been closed already.
+pub async fn trigger_metric_collection() -> Result<(), Error> {
+ if let Some(sender) = CONTROL_MESSAGE_TX.get() {
+ sender.send(ControlMsg::CollectAllRemotes).await?;
+ }
+
+ Ok(())
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 04/23] metric collection: persist state after metric collection
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (2 preceding siblings ...)
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 03/23] metric collection: rework metric poll task Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 05/23] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
` (21 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
The metric collection has to maintain some state, at the moment this
is only the most recent timestamp from all received metric data points.
We use this as a cut-off time when requesting metric from the remote, as
everything *older* than that should already be stored in the database.
Up until now, the state was only stored in memory, which means that it
was lost when the daemon restarted.
The changes in this commit makes the metric collection system
load/save the state from a file in
/var/lib/proxmox-datacenter-manager/metric-collection-state.json
The following data points are stored for every remote:
- most-recent-datapoint:
Timestamp of the most recent datapoint, used as a cut-off when
fetching new metrics
- last-collection:
`last-collection` field saves the timestamp of the last *successful*
metric collection for a remote as local timestamp.
- error:
String representation of any error that occured in the last collection
attempt
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
Notes:
Changes since v1:
- use .inspect_err(..).unwrap_or_default() instead of a manual match
in `MetricCollectionState::new`
.../src/metric_collection/collection_task.rs | 131 ++++++++++++------
server/src/metric_collection/mod.rs | 1 +
server/src/metric_collection/rrd_task.rs | 60 ++++++--
server/src/metric_collection/state.rs | 126 +++++++++++++++++
4 files changed, 268 insertions(+), 50 deletions(-)
create mode 100644 server/src/metric_collection/state.rs
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index a7420edb..c771f4d5 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -1,21 +1,31 @@
-use std::{collections::HashMap, sync::Arc, time::Duration};
+use std::{sync::Arc, time::Duration};
use anyhow::Error;
use tokio::{
sync::{
mpsc::{Receiver, Sender},
- OwnedSemaphorePermit, Semaphore,
+ oneshot, OwnedSemaphorePermit, Semaphore,
},
time::Interval,
};
use proxmox_section_config::typed::SectionConfigData;
+use proxmox_sys::fs::CreateOptions;
use pdm_api_types::remotes::{Remote, RemoteType};
use crate::{connection, task_utils};
-use super::rrd_task::RrdStoreRequest;
+use super::{
+ rrd_task::{RrdStoreRequest, RrdStoreResult},
+ state::{MetricCollectionState, RemoteStatus},
+};
+
+/// Location of the metric collection state file.
+const METRIC_COLLECTION_STATE_FILE: &str = concat!(
+ pdm_buildcfg::PDM_STATE_DIR_M!(),
+ "/metric-collection-state.json"
+);
pub const MAX_CONCURRENT_CONNECTIONS: usize = 20;
@@ -31,7 +41,7 @@ pub(super) enum ControlMsg {
/// Task which periodically collects metrics from all remotes and stores
/// them in the local metrics database.
pub(super) struct MetricCollectionTask {
- most_recent_timestamps: HashMap<String, i64>,
+ state: MetricCollectionState,
metric_data_tx: Sender<RrdStoreRequest>,
control_message_rx: Receiver<ControlMsg>,
}
@@ -42,8 +52,10 @@ impl MetricCollectionTask {
metric_data_tx: Sender<RrdStoreRequest>,
control_message_rx: Receiver<ControlMsg>,
) -> Result<Self, Error> {
+ let state = load_state()?;
+
Ok(Self {
- most_recent_timestamps: HashMap::new(),
+ state,
metric_data_tx,
control_message_rx,
})
@@ -92,6 +104,10 @@ impl MetricCollectionTask {
}
}
}
+
+ if let Err(err) = self.state.save() {
+ log::error!("could not update metric collection state: {err}");
+ }
}
}
@@ -131,7 +147,11 @@ impl MetricCollectionTask {
let mut handles = Vec::new();
for remote_name in remotes_to_fetch {
- let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
+ let status = self
+ .state
+ .get_status(remote_name)
+ .cloned()
+ .unwrap_or_default();
// unwrap is okay here, acquire_* will only fail if `close` has been
// called on the semaphore.
@@ -141,7 +161,7 @@ impl MetricCollectionTask {
log::debug!("fetching remote '{}'", remote.id);
let handle = tokio::spawn(Self::fetch_single_remote(
remote,
- start_time,
+ status,
self.metric_data_tx.clone(),
permit,
));
@@ -155,8 +175,7 @@ impl MetricCollectionTask {
match res {
Ok(Ok(ts)) => {
- self.most_recent_timestamps
- .insert(remote_name.to_string(), ts);
+ self.state.set_status(remote_name, ts);
}
Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
Err(err) => {
@@ -172,45 +191,79 @@ impl MetricCollectionTask {
#[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
async fn fetch_single_remote(
remote: Remote,
- start_time: i64,
+ mut status: RemoteStatus,
sender: Sender<RrdStoreRequest>,
_permit: OwnedSemaphorePermit,
- ) -> Result<i64, Error> {
- let most_recent_timestamp = match remote.ty {
- RemoteType::Pve => {
- let client = connection::make_pve_client(&remote)?;
- let metrics = client
- .cluster_metrics_export(Some(true), Some(false), Some(start_time))
- .await?;
+ ) -> Result<RemoteStatus, Error> {
+ let (result_tx, result_rx) = oneshot::channel();
- let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+ let now = proxmox_time::epoch_i64();
- sender
- .send(RrdStoreRequest::Pve {
- remote: remote.id.clone(),
- metrics,
- })
- .await?;
+ let res: Result<RrdStoreResult, Error> = async {
+ match remote.ty {
+ RemoteType::Pve => {
+ let client = connection::make_pve_client(&remote)?;
+ let metrics = client
+ .cluster_metrics_export(
+ Some(true),
+ Some(false),
+ Some(status.most_recent_datapoint),
+ )
+ .await?;
- most_recent
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote.id.clone(),
+ metrics,
+ channel: result_tx,
+ })
+ .await?;
+ }
+ RemoteType::Pbs => {
+ let client = connection::make_pbs_client(&remote)?;
+ let metrics = client
+ .metrics(Some(true), Some(status.most_recent_datapoint))
+ .await?;
+
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote.id.clone(),
+ metrics,
+ channel: result_tx,
+ })
+ .await?;
+ }
}
- RemoteType::Pbs => {
- let client = connection::make_pbs_client(&remote)?;
- let metrics = client.metrics(Some(true), Some(start_time)).await?;
- let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+ result_rx.await.map_err(Error::from)
+ }
+ .await;
- sender
- .send(RrdStoreRequest::Pbs {
- remote: remote.id.clone(),
- metrics,
- })
- .await?;
-
- most_recent
+ match res {
+ Ok(result) => {
+ status.most_recent_datapoint = result.most_recent_timestamp;
+ status.last_collection = Some(now);
+ status.error = None;
}
- };
+ Err(err) => {
+ status.error = Some(err.to_string());
+ log::error!("coud not fetch metrics from '{}': {err}", remote.id);
+ }
+ }
- Ok(most_recent_timestamp)
+ Ok(status)
}
}
+
+/// Load the metric collection state file.
+pub(super) fn load_state() -> Result<MetricCollectionState, Error> {
+ let api_uid = pdm_config::api_user()?.uid;
+ let api_gid = pdm_config::api_group()?.gid;
+
+ let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
+
+ Ok(MetricCollectionState::new(
+ METRIC_COLLECTION_STATE_FILE.into(),
+ file_options,
+ ))
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 9b203615..9cd60455 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -7,6 +7,7 @@ use tokio::sync::mpsc::{self, Sender};
mod collection_task;
pub mod rrd_cache;
mod rrd_task;
+mod state;
pub mod top_entities;
use collection_task::{ControlMsg, MetricCollectionTask};
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index a72945df..1c618f54 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -1,8 +1,10 @@
use anyhow::Error;
-use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
+use tokio::sync::{mpsc::Receiver, oneshot};
+
use proxmox_rrd::rrd::DataSourceType;
+
+use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
-use tokio::sync::mpsc::Receiver;
use super::rrd_cache;
@@ -14,6 +16,8 @@ pub(super) enum RrdStoreRequest {
remote: String,
/// Metric data.
metrics: ClusterMetrics,
+ /// Oneshot channel to return the [`RrdStoreResult`].
+ channel: oneshot::Sender<RrdStoreResult>,
},
/// Store PBS metrics.
Pbs {
@@ -21,9 +25,17 @@ pub(super) enum RrdStoreRequest {
remote: String,
/// Metric data.
metrics: Metrics,
+ /// Oneshot channel to return the [`RrdStoreResult`].
+ channel: oneshot::Sender<RrdStoreResult>,
},
}
+/// Result for a [`RrdStoreRequest`].
+pub(super) struct RrdStoreResult {
+ /// Most recent timestamp of any stored metric datapoint (UNIX epoch).
+ pub(super) most_recent_timestamp: i64,
+}
+
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
@@ -31,17 +43,43 @@ pub(super) async fn store_in_rrd_task(
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
// Involves some blocking file IO
- let res = tokio::task::spawn_blocking(move || match msg {
- RrdStoreRequest::Pve { remote, metrics } => {
- for data_point in metrics.data {
- store_metric_pve(&remote, &data_point);
+ let res = tokio::task::spawn_blocking(move || {
+ let mut most_recent_timestamp = 0;
+ let channel = match msg {
+ RrdStoreRequest::Pve {
+ remote,
+ metrics,
+ channel,
+ } => {
+ for data_point in metrics.data {
+ most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
+ store_metric_pve(&remote, &data_point);
+ }
+
+ channel
}
- }
- RrdStoreRequest::Pbs { remote, metrics } => {
- for data_point in metrics.data {
- store_metric_pbs(&remote, &data_point);
+ RrdStoreRequest::Pbs {
+ remote,
+ metrics,
+ channel,
+ } => {
+ for data_point in metrics.data {
+ most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
+ store_metric_pbs(&remote, &data_point);
+ }
+
+ channel
}
- }
+ };
+
+ if channel
+ .send(RrdStoreResult {
+ most_recent_timestamp,
+ })
+ .is_err()
+ {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
})
.await;
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
new file mode 100644
index 00000000..a0bf10fe
--- /dev/null
+++ b/server/src/metric_collection/state.rs
@@ -0,0 +1,126 @@
+use std::{
+ collections::HashMap,
+ path::{Path, PathBuf},
+};
+
+use anyhow::Error;
+use serde::{Deserialize, Serialize};
+
+use proxmox_sys::fs::CreateOptions;
+
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// Metric collection state file content.
+struct State {
+ remote_status: HashMap<String, RemoteStatus>,
+}
+
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// A remote's metric collection state.
+pub struct RemoteStatus {
+ /// Most recent datapoint - time stamp is based on remote time
+ pub most_recent_datapoint: i64,
+ /// Last successful metric collection - timestamp based on PDM's time
+ pub last_collection: Option<i64>,
+ #[serde(skip_serializing_if = "Option::is_none")]
+ /// Any error that occured during the last metric collection attempt.
+ pub error: Option<String>,
+}
+
+/// Manage and persist metric collection state.
+pub struct MetricCollectionState {
+ /// Path to the persisted state
+ path: PathBuf,
+ /// File owner/perms for the persisted state file
+ file_options: CreateOptions,
+ /// The current state
+ state: State,
+}
+
+impl MetricCollectionState {
+ /// Initialize state by trying to load the existing statefile. If the file does not exist,
+ /// state will be empty. If the file failed to load, state will be empty and
+ /// and error will be logged.
+ pub fn new(statefile: PathBuf, file_options: CreateOptions) -> Self {
+ let state = Self::load_or_default(&statefile)
+ .inspect_err(|err| {
+ log::error!("could not load metric collection state: {err}");
+ })
+ .unwrap_or_default();
+
+ Self {
+ path: statefile,
+ file_options,
+ state,
+ }
+ }
+
+ /// Set a remote's status.
+ pub fn set_status(&mut self, remote: String, remote_state: RemoteStatus) {
+ self.state.remote_status.insert(remote, remote_state);
+ }
+
+ /// Get a remote's status.
+ pub fn get_status(&self, remote: &str) -> Option<&RemoteStatus> {
+ self.state.remote_status.get(remote)
+ }
+
+ /// Persist the state to the statefile.
+ pub fn save(&self) -> Result<(), Error> {
+ let data = serde_json::to_vec_pretty(&self.state)?;
+ proxmox_sys::fs::replace_file(&self.path, &data, self.file_options, true)?;
+
+ Ok(())
+ }
+
+ fn load_or_default(path: &Path) -> Result<State, Error> {
+ let content = proxmox_sys::fs::file_read_optional_string(path)?;
+
+ if let Some(content) = content {
+ Ok(serde_json::from_str(&content)?)
+ } else {
+ Ok(Default::default())
+ }
+ }
+}
+
+#[cfg(test)]
+mod tests {
+ use proxmox_sys::fs::CreateOptions;
+
+ use super::*;
+
+ use crate::test_support::temp::NamedTempFile;
+
+ fn get_options() -> CreateOptions {
+ CreateOptions::new()
+ .owner(nix::unistd::Uid::effective())
+ .group(nix::unistd::Gid::effective())
+ .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
+ }
+
+ #[test]
+ fn save_and_load() -> Result<(), Error> {
+ let file = NamedTempFile::new(get_options())?;
+ let options = get_options();
+ let mut state = MetricCollectionState::new(file.path().into(), options.clone());
+
+ state.set_status(
+ "some-remote".into(),
+ RemoteStatus {
+ most_recent_datapoint: 1234,
+ ..Default::default()
+ },
+ );
+
+ state.save()?;
+
+ let state = MetricCollectionState::new(file.path().into(), options);
+
+ let status = state.get_status("some-remote").unwrap();
+ assert_eq!(status.most_recent_datapoint, 1234);
+
+ Ok(())
+ }
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 05/23] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (3 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 04/23] metric collection: persist state after metric collection Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 06/23] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
` (20 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
It makes sense to limit the collection frequency per remote, especially
since we will have the ability to trigger collection manually.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
server/src/metric_collection/collection_task.rs | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index c771f4d5..27e097f7 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -31,6 +31,8 @@ pub const MAX_CONCURRENT_CONNECTIONS: usize = 20;
/// Default metric collection interval.
pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
+/// Minimum metric collection interval.
+pub const MIN_COLLECTION_INTERVAL: u64 = 10;
/// Control messages for the metric collection task.
pub(super) enum ControlMsg {
@@ -145,6 +147,7 @@ impl MetricCollectionTask {
) {
let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT_CONNECTIONS));
let mut handles = Vec::new();
+ let now = proxmox_time::epoch_i64();
for remote_name in remotes_to_fetch {
let status = self
@@ -153,6 +156,13 @@ impl MetricCollectionTask {
.cloned()
.unwrap_or_default();
+ if now - status.last_collection.unwrap_or(0) < MIN_COLLECTION_INTERVAL as i64 {
+ log::debug!(
+ "skipping metric collection for remote '{remote_name}' - data is recent enough"
+ );
+ continue;
+ }
+
// unwrap is okay here, acquire_* will only fail if `close` has been
// called on the semaphore.
let permit = Arc::clone(&semaphore).acquire_owned().await.unwrap();
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 06/23] metric collection: collect overdue metrics on startup/timer change
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (4 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 05/23] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 07/23] metric collection: add tests for the fetch_remotes function Lukas Wagner
` (19 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
Due to the fact that the timer fires at aligned points in time and might
now fire right away after being set up, it could happen that we get gaps
in the data if we change the timer interval or at daemon startup.
To mitigate this, on daemon startup and also if the collection interval
changes, we
- check if the time until the next scheduled regular collection
plus the time *since* the last successful collection exceeds
the collection interval
- if yes, we collect immediately
- if no, we do nothing and let the remote be collected at the
next timer tick
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
Notes:
Changes since v1:
- Document return values of `setup_timer`
.../src/metric_collection/collection_task.rs | 57 +++++++++++++++++--
1 file changed, 51 insertions(+), 6 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 27e097f7..1821f138 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -1,4 +1,7 @@
-use std::{sync::Arc, time::Duration};
+use std::{
+ sync::Arc,
+ time::{Duration, Instant},
+};
use anyhow::Error;
use tokio::{
@@ -68,12 +71,17 @@ impl MetricCollectionTask {
/// This function never returns.
#[tracing::instrument(skip_all, name = "metric_collection_task")]
pub(super) async fn run(&mut self) {
- let mut timer = Self::setup_timer(DEFAULT_COLLECTION_INTERVAL);
+ let (mut timer, first_tick) = Self::setup_timer(DEFAULT_COLLECTION_INTERVAL);
log::debug!(
"metric collection starting up. Collection interval set to {} seconds.",
DEFAULT_COLLECTION_INTERVAL,
);
+ // Check and fetch any remote which would be overdue by the time the
+ // timer first fires.
+ if let Some(remote_config) = Self::load_remote_config() {
+ self.fetch_overdue(&remote_config, first_tick).await;
+ }
loop {
tokio::select! {
@@ -116,12 +124,16 @@ impl MetricCollectionTask {
/// Set up a [`tokio::time::Interval`] instance with the provided interval.
/// The timer will be aligned, e.g. an interval of `60` will let the timer
/// fire at minute boundaries.
- fn setup_timer(interval: u64) -> Interval {
+ ///
+ /// The return values are a tuple of the [`tokio::time::Interval`] timer instance
+ /// and the [`std::time::Instant`] at which the timer first fires.
+ fn setup_timer(interval: u64) -> (Interval, Instant) {
+ log::debug!("setting metric collection interval timer to {interval} seconds.",);
let mut timer = tokio::time::interval(Duration::from_secs(interval));
- let first_run = task_utils::next_aligned_instant(interval).into();
- timer.reset_at(first_run);
+ let first_run = task_utils::next_aligned_instant(interval);
+ timer.reset_at(first_run.into());
- timer
+ (timer, first_run)
}
/// Convenience helper to load `remote.cfg`, logging the error
@@ -197,6 +209,39 @@ impl MetricCollectionTask {
}
}
+ /// Fetch metric data from remotes which are overdue for collection.
+ ///
+ /// Use this on startup of the metric collection loop as well as
+ /// when the collection interval changes.
+ async fn fetch_overdue(
+ &mut self,
+ remote_config: &SectionConfigData<Remote>,
+ next_run: Instant,
+ ) {
+ let left_until_scheduled = next_run - Instant::now();
+ let now = proxmox_time::epoch_i64();
+
+ let mut overdue = Vec::new();
+
+ for (remote, _) in remote_config.iter() {
+ let last_collection = self
+ .state
+ .get_status(remote)
+ .and_then(|s| s.last_collection)
+ .unwrap_or(0);
+
+ let diff = now - last_collection;
+
+ if diff + left_until_scheduled.as_secs() as i64 > DEFAULT_COLLECTION_INTERVAL as i64 {
+ log::debug!(
+ "starting metric collection for remote '{remote}' - triggered because collection is overdue"
+ );
+ overdue.push(remote.into());
+ }
+ }
+ self.fetch_remotes(remote_config, &overdue).await;
+ }
+
/// Fetch a single remote.
#[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
async fn fetch_single_remote(
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 07/23] metric collection: add tests for the fetch_remotes function
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (5 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 06/23] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 08/23] metric collection: add test for fetch_overdue Lukas Wagner
` (18 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
Since we have the option to have mocked remote clients and a mocked
remote config, as well as due to the decoupling of the metric collection
and RRD task, it is now viable to write a unit test for one of the core
pieces of the metric collection system.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 226 ++++++++++++++++++
server/src/metric_collection/state.rs | 18 +-
2 files changed, 231 insertions(+), 13 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 1821f138..35f85c79 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -322,3 +322,229 @@ pub(super) fn load_state() -> Result<MetricCollectionState, Error> {
file_options,
))
}
+
+#[cfg(test)]
+pub(super) mod tests {
+ use std::sync::Once;
+
+ use anyhow::bail;
+ use http::StatusCode;
+
+ use pdm_api_types::Authid;
+ use pve_api_types::{ClusterMetrics, ClusterMetricsData};
+
+ use crate::{
+ connection::{ClientFactory, PveClient},
+ metric_collection::rrd_task::RrdStoreResult,
+ pbs_client::PbsClient,
+ test_support::temp::NamedTempFile,
+ };
+
+ use super::*;
+
+ pub(crate) fn get_create_options() -> CreateOptions {
+ CreateOptions::new()
+ .owner(nix::unistd::Uid::effective())
+ .group(nix::unistd::Gid::effective())
+ .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
+ }
+
+ struct TestClientFactory {
+ now: i64,
+ }
+
+ #[async_trait::async_trait]
+ impl ClientFactory for TestClientFactory {
+ fn make_pve_client(&self, remote: &Remote) -> Result<Arc<PveClient>, Error> {
+ Ok(Arc::new(TestPveClient {
+ fail: remote.id.contains("fail"),
+ now: self.now,
+ }))
+ }
+ /// Create a new API client for PVE remotes, but with a specific endpoint.
+ fn make_pve_client_with_endpoint(
+ &self,
+ _remote: &Remote,
+ _target_endpoint: Option<&str>,
+ ) -> Result<Arc<PveClient>, Error> {
+ bail!("not implemented")
+ }
+
+ fn make_pbs_client(&self, _remote: &Remote) -> Result<Box<PbsClient>, Error> {
+ bail!("not implemented")
+ }
+
+ async fn make_pve_client_and_login(
+ &self,
+ _remote: &Remote,
+ ) -> Result<Arc<PveClient>, Error> {
+ bail!("not implemented")
+ }
+
+ async fn make_pbs_client_and_login(
+ &self,
+ _remote: &Remote,
+ ) -> Result<Box<PbsClient>, Error> {
+ bail!("not implemented")
+ }
+ }
+
+ struct TestPveClient {
+ now: i64,
+ fail: bool,
+ }
+
+ #[async_trait::async_trait]
+ impl pve_api_types::client::PveClient for TestPveClient {
+ /// Retrieve metrics of the cluster.
+ async fn cluster_metrics_export(
+ &self,
+ _history: Option<bool>,
+ _local_only: Option<bool>,
+ start_time: Option<i64>,
+ ) -> Result<ClusterMetrics, proxmox_client::Error> {
+ if self.fail {
+ return Err(proxmox_client::Error::Api(
+ StatusCode::INTERNAL_SERVER_ERROR,
+ "internal server error".into(),
+ ));
+ }
+
+ let mut time = start_time.unwrap_or(0);
+ time = time.max(self.now - (30 * 60));
+ let mut data = Vec::new();
+
+ use pve_api_types::ClusterMetricsDataType::*;
+
+ while time < self.now {
+ let point = |id: &str, metric: &str, timestamp, ty| ClusterMetricsData {
+ id: id.into(),
+ metric: metric.into(),
+ timestamp,
+ ty,
+ value: 0.1,
+ };
+
+ for i in 0..5 {
+ let id = format!("node/node-{i}");
+ data.push(point(&id, "cpu_current", time, Gauge));
+ }
+
+ // Advance time by 10 seconds
+ time += 10;
+ }
+
+ Ok(ClusterMetrics { data })
+ }
+ }
+
+ fn make_remote_config() -> SectionConfigData<Remote> {
+ let mut sections = SectionConfigData::default();
+
+ for i in 0..4 {
+ let status = if i >= 2 { "fail" } else { "pass" };
+ let name = format!("pve-{i}-{status}");
+
+ sections.insert(
+ name.clone(),
+ Remote {
+ ty: pdm_api_types::remotes::RemoteType::Pve,
+ id: name.clone(),
+ nodes: Vec::new(),
+ authid: Authid::root_auth_id().clone(),
+ token: "".into(),
+ web_url: None,
+ },
+ );
+ }
+
+ sections
+ }
+
+ async fn fake_rrd_task(mut rx: Receiver<RrdStoreRequest>) -> u32 {
+ let mut number_of_requests = 0;
+
+ while let Some(request) = rx.recv().await {
+ number_of_requests += 1;
+
+ if let RrdStoreRequest::Pve {
+ metrics, channel, ..
+ } = request
+ {
+ let most_recent_timestamp =
+ metrics.data.iter().map(|e| e.timestamp).max().unwrap_or(0);
+
+ let _ = channel.send(RrdStoreResult {
+ most_recent_timestamp,
+ });
+ }
+ }
+
+ number_of_requests
+ }
+
+ static START: Once = Once::new();
+
+ fn test_init() -> i64 {
+ let now = 10000;
+ START.call_once(|| {
+ // TODO: the client factory is currently stored in a OnceLock -
+ // we can only set it from one test... Ideally we'd like to have the
+ // option to set it in every single test if needed - task/thread local?
+ connection::init(Box::new(TestClientFactory { now }));
+ });
+
+ now
+ }
+
+ #[tokio::test]
+ async fn test_fetch_remotes_updates_state() {
+ // Arrange
+ let now = test_init();
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let handle = tokio::task::spawn(fake_rrd_task(rx));
+
+ let config = make_remote_config();
+
+ let state_file = NamedTempFile::new(get_create_options()).unwrap();
+ let state = MetricCollectionState::new(state_file.path().into(), get_create_options());
+
+ let (_control_tx, control_rx) = tokio::sync::mpsc::channel(10);
+
+ let mut task = MetricCollectionTask {
+ state,
+ metric_data_tx: tx,
+ control_message_rx: control_rx,
+ };
+
+ // Act
+ let to_fetch = config
+ .iter()
+ .map(|(name, _)| name.into())
+ .collect::<Vec<String>>();
+ task.fetch_remotes(&config, &to_fetch).await;
+
+ // Assert
+ for remote in &to_fetch {
+ let status = task.state.get_status(remote).unwrap();
+
+ // Our faked PVE client will return an error if the remote name contains
+ // 'fail'.
+ if remote.contains("fail") {
+ assert!(status
+ .error
+ .as_ref()
+ .unwrap()
+ .contains("internal server error"));
+ assert_eq!(status.last_collection, None);
+ } else {
+ assert!(now - status.most_recent_datapoint <= 10);
+ assert!(status.error.is_none());
+ }
+ }
+
+ drop(task);
+ assert_eq!(handle.await.unwrap(), 2);
+ }
+}
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
index a0bf10fe..e76444a9 100644
--- a/server/src/metric_collection/state.rs
+++ b/server/src/metric_collection/state.rs
@@ -87,24 +87,16 @@ impl MetricCollectionState {
#[cfg(test)]
mod tests {
- use proxmox_sys::fs::CreateOptions;
+ use crate::metric_collection::collection_task::tests::get_create_options;
+ use crate::test_support::temp::NamedTempFile;
use super::*;
- use crate::test_support::temp::NamedTempFile;
-
- fn get_options() -> CreateOptions {
- CreateOptions::new()
- .owner(nix::unistd::Uid::effective())
- .group(nix::unistd::Gid::effective())
- .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
- }
-
#[test]
fn save_and_load() -> Result<(), Error> {
- let file = NamedTempFile::new(get_options())?;
- let options = get_options();
- let mut state = MetricCollectionState::new(file.path().into(), options.clone());
+ let file = NamedTempFile::new(get_create_options())?;
+ let options = get_create_options();
+ let mut state = MetricCollectionState::new(file.path().into(), options);
state.set_status(
"some-remote".into(),
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 08/23] metric collection: add test for fetch_overdue
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (6 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 07/23] metric collection: add tests for the fetch_remotes function Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 09/23] metric collection: pass rrd cache instance as function parameter Lukas Wagner
` (17 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This test ensure that the logic for fetching metrics from remotes which
are overdue for collection continues to work as expected.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 62 ++++++++++++++++++-
1 file changed, 60 insertions(+), 2 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 35f85c79..f2fe42a4 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -80,7 +80,8 @@ impl MetricCollectionTask {
// Check and fetch any remote which would be overdue by the time the
// timer first fires.
if let Some(remote_config) = Self::load_remote_config() {
- self.fetch_overdue(&remote_config, first_tick).await;
+ self.fetch_overdue(&remote_config, first_tick, DEFAULT_COLLECTION_INTERVAL)
+ .await;
}
loop {
@@ -217,6 +218,7 @@ impl MetricCollectionTask {
&mut self,
remote_config: &SectionConfigData<Remote>,
next_run: Instant,
+ collection_interval: u64,
) {
let left_until_scheduled = next_run - Instant::now();
let now = proxmox_time::epoch_i64();
@@ -232,7 +234,7 @@ impl MetricCollectionTask {
let diff = now - last_collection;
- if diff + left_until_scheduled.as_secs() as i64 > DEFAULT_COLLECTION_INTERVAL as i64 {
+ if diff + left_until_scheduled.as_secs() as i64 > collection_interval as i64 {
log::debug!(
"starting metric collection for remote '{remote}' - triggered because collection is overdue"
);
@@ -547,4 +549,60 @@ pub(super) mod tests {
drop(task);
assert_eq!(handle.await.unwrap(), 2);
}
+
+ #[tokio::test]
+ async fn test_fetch_overdue() {
+ // Arrange
+ test_init();
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let handle = tokio::task::spawn(fake_rrd_task(rx));
+
+ let config = make_remote_config();
+
+ let state_file = NamedTempFile::new(get_create_options()).unwrap();
+ let mut state = MetricCollectionState::new(state_file.path().into(), get_create_options());
+
+ let now = proxmox_time::epoch_i64();
+
+ // This one should be fetched
+ state.set_status(
+ "pve-0-pass".into(),
+ RemoteStatus {
+ last_collection: Some(now - 35),
+ ..Default::default()
+ },
+ );
+ // This one should *not* be fetched
+ state.set_status(
+ "pve-1-pass".into(),
+ RemoteStatus {
+ last_collection: Some(now - 25),
+ ..Default::default()
+ },
+ );
+
+ let (_control_tx, control_rx) = tokio::sync::mpsc::channel(10);
+
+ let mut task = MetricCollectionTask {
+ state,
+ metric_data_tx: tx,
+ control_message_rx: control_rx,
+ };
+
+ let next_collection = Instant::now() + Duration::from_secs(30);
+
+ // Act
+ task.fetch_overdue(&config, next_collection, 60).await;
+
+ // Assert
+ let status = task.state.get_status("pve-0-pass").unwrap();
+ assert!(status.last_collection.unwrap() - now >= 0);
+
+ let status = task.state.get_status("pve-1-pass").unwrap();
+ assert_eq!(status.last_collection.unwrap(), now - 25);
+
+ drop(task);
+ assert_eq!(handle.await.unwrap(), 1);
+ }
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 09/23] metric collection: pass rrd cache instance as function parameter
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (7 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 08/23] metric collection: add test for fetch_overdue Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 10/23] metric collection: add test for rrd task Lukas Wagner
` (16 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This enables us to do dependency injection for tests by passing in the
rrd cache to use, instead of having to replace the global static
instance. The latter is always awkward because tests might run
multi-threaded, so the global instance could/should be a thread-local,
but this again is weird in tokio-world where we actually want task-local
variables, which in turn are weird once you actually start new tasks,
which then don't access the same task-local variables as their parent
task... - long story short, passing in the dependency as a parameter
makes things easier.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
server/src/api/rrd_common.rs | 4 +-
server/src/metric_collection/mod.rs | 15 +++-
server/src/metric_collection/rrd_cache.rs | 77 ++++++++++++--------
server/src/metric_collection/rrd_task.rs | 16 ++--
server/src/metric_collection/top_entities.rs | 3 +
5 files changed, 75 insertions(+), 40 deletions(-)
diff --git a/server/src/api/rrd_common.rs b/server/src/api/rrd_common.rs
index 0d82d0c3..d9ed017a 100644
--- a/server/src/api/rrd_common.rs
+++ b/server/src/api/rrd_common.rs
@@ -23,9 +23,11 @@ pub fn create_datapoints_from_rrd<T: DataPoint>(
let mut timemap = BTreeMap::new();
let mut last_resolution = None;
+ let cache = rrd_cache::get_cache();
+
for name in T::fields() {
let (start, resolution, data) =
- match rrd_cache::extract_data(basedir, name, timeframe, mode)? {
+ match rrd_cache::extract_data(&cache, basedir, name, timeframe, mode)? {
Some(data) => data.into(),
None => continue,
};
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 9cd60455..509d4f88 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -4,6 +4,8 @@ use std::sync::OnceLock;
use anyhow::{bail, Error};
use tokio::sync::mpsc::{self, Sender};
+use proxmox_sys::fs::CreateOptions;
+
mod collection_task;
pub mod rrd_cache;
mod rrd_task;
@@ -16,7 +18,14 @@ static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
/// Initialize the RRD cache
pub fn init() -> Result<(), Error> {
- rrd_cache::init()?;
+ let api_uid = pdm_config::api_user()?.uid;
+ let api_gid = pdm_config::api_group()?.gid;
+
+ let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
+ let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
+
+ let cache = rrd_cache::init(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
+ rrd_cache::set_cache(cache)?;
Ok(())
}
@@ -42,8 +51,10 @@ pub fn start_task() -> Result<(), Error> {
futures::future::select(metric_collection_task_future, abort_future).await;
});
+ let cache = rrd_cache::get_cache();
+
tokio::spawn(async move {
- let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
+ let rrd_task_future = pin!(rrd_task::store_in_rrd_task(cache, metric_data_rx));
let abort_future = pin!(proxmox_daemon::shutdown_future());
futures::future::select(rrd_task_future, abort_future).await;
});
diff --git a/server/src/metric_collection/rrd_cache.rs b/server/src/metric_collection/rrd_cache.rs
index a8d72b80..70c91ca6 100644
--- a/server/src/metric_collection/rrd_cache.rs
+++ b/server/src/metric_collection/rrd_cache.rs
@@ -5,6 +5,7 @@
//! and update RRD data inside `proxmox-datacenter-api`.
use std::path::Path;
+use std::sync::Arc;
use anyhow::{format_err, Error};
use once_cell::sync::OnceCell;
@@ -16,32 +17,45 @@ use proxmox_sys::fs::CreateOptions;
use pdm_buildcfg::PDM_STATE_DIR_M;
-const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
+pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
-static RRD_CACHE: OnceCell<Cache> = OnceCell::new();
+// This is an `Arc` because this makes it easier to do dependency injection
+// in test contexts.
+//
+// For DI in testing, we want to pass in a reference to the Cache
+// as a function parameter. In a couple of these functions we
+// spawn tokio tasks which need access to the reference, hence the
+// reference needs to be 'static. In a test context, we kind of have a
+// hard time to come up with a 'static reference, so we just
+// wrap the cache in an `Arc` for now, solving the
+// lifetime problem via refcounting.
+static RRD_CACHE: OnceCell<Arc<Cache>> = OnceCell::new();
/// Get the RRD cache instance
-fn get_cache() -> Result<&'static Cache, Error> {
+pub fn get_cache() -> Arc<Cache> {
+ RRD_CACHE.get().cloned().expect("rrd cache not initialized")
+}
+
+pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
RRD_CACHE
- .get()
- .ok_or_else(|| format_err!("RRD cache not initialized!"))
+ .set(cache)
+ .map_err(|_| format_err!("RRD cache already initialized!"))?;
+
+ Ok(())
}
/// Initialize the RRD cache instance
///
/// Note: Only a single process must do this (proxmox-datacenter-api)
-pub fn init() -> Result<&'static Cache, Error> {
- let api_uid = pdm_config::api_user()?.uid;
- let api_gid = pdm_config::api_group()?.gid;
-
- let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
- let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
+pub fn init<P: AsRef<Path>>(
+ base_path: P,
+ dir_options: CreateOptions,
+ file_options: CreateOptions,
+) -> Result<Arc<Cache>, Error> {
let apply_interval = 30.0 * 60.0; // 30 minutes
let cache = Cache::new(
- RRD_CACHE_BASEDIR,
+ base_path,
Some(file_options),
Some(dir_options),
apply_interval,
@@ -51,11 +65,7 @@ pub fn init() -> Result<&'static Cache, Error> {
cache.apply_journal()?;
- RRD_CACHE
- .set(cache)
- .map_err(|_| format_err!("RRD cache already initialized!"))?;
-
- Ok(RRD_CACHE.get().unwrap())
+ Ok(Arc::new(cache))
}
fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
@@ -91,6 +101,7 @@ fn create_callback(dst: DataSourceType) -> Database {
/// Extracts data for the specified time frame from RRD cache
pub fn extract_data(
+ rrd_cache: &Cache,
basedir: &str,
name: &str,
timeframe: RrdTimeframe,
@@ -112,26 +123,28 @@ pub fn extract_data(
RrdMode::Average => AggregationFn::Average,
};
- let rrd_cache = get_cache()?;
-
rrd_cache.extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
}
/// Sync/Flush the RRD journal
pub fn sync_journal() {
- if let Ok(rrd_cache) = get_cache() {
- if let Err(err) = rrd_cache.sync_journal() {
- log::error!("rrd_sync_journal failed - {err}");
- }
+ let rrd_cache = get_cache();
+ if let Err(err) = rrd_cache.sync_journal() {
+ log::error!("rrd_sync_journal failed - {err}");
}
}
+
/// Update RRD Gauge values
-pub fn update_value(name: &str, value: f64, timestamp: i64, datasource_type: DataSourceType) {
- if let Ok(rrd_cache) = get_cache() {
- if let Err(err) =
- rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
- {
- log::error!("rrd::update_value '{name}' failed - {err}");
- }
+pub fn update_value(
+ rrd_cache: &Cache,
+ name: &str,
+ value: f64,
+ timestamp: i64,
+ datasource_type: DataSourceType,
+) {
+ if let Err(err) =
+ rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
+ {
+ log::error!("rrd::update_value '{name}' failed - {err}");
}
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 1c618f54..3704b0e7 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -1,7 +1,9 @@
+use std::sync::Arc;
+
use anyhow::Error;
use tokio::sync::{mpsc::Receiver, oneshot};
-use proxmox_rrd::rrd::DataSourceType;
+use proxmox_rrd::{rrd::DataSourceType, Cache};
use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
@@ -39,9 +41,11 @@ pub(super) struct RrdStoreResult {
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
+ cache: Arc<Cache>,
mut receiver: Receiver<RrdStoreRequest>,
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
+ let cache_clone = Arc::clone(&cache);
// Involves some blocking file IO
let res = tokio::task::spawn_blocking(move || {
let mut most_recent_timestamp = 0;
@@ -53,7 +57,7 @@ pub(super) async fn store_in_rrd_task(
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
- store_metric_pve(&remote, &data_point);
+ store_metric_pve(&cache_clone, &remote, &data_point);
}
channel
@@ -65,7 +69,7 @@ pub(super) async fn store_in_rrd_task(
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
- store_metric_pbs(&remote, &data_point);
+ store_metric_pbs(&cache_clone, &remote, &data_point);
}
channel
@@ -91,7 +95,7 @@ pub(super) async fn store_in_rrd_task(
Ok(())
}
-fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
+fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetricsData) {
let name = format!(
"pve/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -105,6 +109,7 @@ fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
};
rrd_cache::update_value(
+ cache,
&name,
data_point.value,
data_point.timestamp,
@@ -112,7 +117,7 @@ fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
);
}
-fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
+fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoint) {
let name = format!(
"pbs/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -126,6 +131,7 @@ fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
};
rrd_cache::update_value(
+ cache,
&name,
data_point.value,
data_point.timestamp,
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
index f8e053fb..f54ee72f 100644
--- a/server/src/metric_collection/top_entities.rs
+++ b/server/src/metric_collection/top_entities.rs
@@ -121,7 +121,10 @@ fn get_entity(
name: String,
metric: &str,
) -> Option<(f64, TopEntity)> {
+ let cache = rrd_cache::get_cache();
+
if let Ok(Some(values)) = rrd_cache::extract_data(
+ &cache,
&name,
metric,
timeframe,
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 10/23] metric collection: add test for rrd task
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (8 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 09/23] metric collection: pass rrd cache instance as function parameter Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 11/23] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
` (15 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This commit adds a quick smoke-test for the `store_in_rrd` task
function. Unfortunately, due to some race-condition in proxmox_rrd,
some extra measures had to be taken in order to make the test not flaky.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
server/src/metric_collection/rrd_task.rs | 92 ++++++++++++++++++++++++
1 file changed, 92 insertions(+)
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 3704b0e7..27327a29 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -138,3 +138,95 @@ fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoi
data_source_type,
);
}
+
+#[cfg(test)]
+mod tests {
+ use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
+ use pve_api_types::{ClusterMetrics, ClusterMetricsData};
+
+ use crate::{
+ metric_collection::collection_task::tests::get_create_options,
+ test_support::temp::NamedTempDir,
+ };
+
+ use super::*;
+
+ #[tokio::test]
+ async fn test_rrd_task_persists_data() -> Result<(), Error> {
+ // Arrange
+ let dir = NamedTempDir::new()?;
+ let options = get_create_options().perm(nix::sys::stat::Mode::from_bits_truncate(0o700));
+ let cache = rrd_cache::init(&dir.path(), options.clone(), options.clone())?;
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let task = store_in_rrd_task(Arc::clone(&cache), rx);
+ let handle = tokio::task::spawn(task);
+
+ let now = proxmox_time::epoch_i64();
+
+ let metrics = ClusterMetrics {
+ data: vec![
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 30,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.1,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 20,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.2,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 10,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.1,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.2,
+ },
+ ],
+ };
+ let (tx_back, rx_back) = tokio::sync::oneshot::channel();
+ let request = RrdStoreRequest::Pve {
+ remote: "some-remote".into(),
+ metrics,
+ channel: tx_back,
+ };
+
+ // Act
+ tx.send(request).await?;
+ let result = rx_back.await?;
+
+ // Assert
+ assert_eq!(result.most_recent_timestamp, now);
+
+ drop(tx);
+ handle.await??;
+
+ // There is some race condition in proxmox_rrd, in some rare cases
+ // extract_data does not return any data directly after writing.
+ if let Some(data) = rrd_cache::extract_data(
+ &cache,
+ "pve/some-remote/node/some-node",
+ "cpu_current",
+ RrdTimeframe::Hour,
+ RrdMode::Max,
+ )? {
+ // Only assert that there are some data points, the exact position in the vec
+ // might vary due to changed boundaries.
+ assert!(data.data.iter().any(Option::is_some));
+ }
+
+ Ok(())
+ }
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 11/23] metric collection: wrap rrd_cache::Cache in a struct
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (9 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 10/23] metric collection: add test for rrd task Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 12/23] metric collection: record remote response time in metric database Lukas Wagner
` (14 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
All helper functions storing/retrieving helper functions take the cache
instance as a first parameter. This smells like it should be a method on
a struct.
This commit wraps the foreign `proxmox_rrd::Cache` type in a new type,
transforms the `init` function into a `new` method and all other helpers
into further methods.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
server/src/api/rrd_common.rs | 9 +-
server/src/metric_collection/mod.rs | 6 +-
server/src/metric_collection/rrd_cache.rs | 201 +++++++++----------
server/src/metric_collection/rrd_task.rs | 21 +-
server/src/metric_collection/top_entities.rs | 3 +-
5 files changed, 118 insertions(+), 122 deletions(-)
diff --git a/server/src/api/rrd_common.rs b/server/src/api/rrd_common.rs
index d9ed017a..28868bc1 100644
--- a/server/src/api/rrd_common.rs
+++ b/server/src/api/rrd_common.rs
@@ -26,11 +26,10 @@ pub fn create_datapoints_from_rrd<T: DataPoint>(
let cache = rrd_cache::get_cache();
for name in T::fields() {
- let (start, resolution, data) =
- match rrd_cache::extract_data(&cache, basedir, name, timeframe, mode)? {
- Some(data) => data.into(),
- None => continue,
- };
+ let (start, resolution, data) = match cache.extract_data(basedir, name, timeframe, mode)? {
+ Some(data) => data.into(),
+ None => continue,
+ };
if let Some(expected_resolution) = last_resolution {
if resolution != expected_resolution {
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 509d4f88..5b6c14d2 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -1,4 +1,5 @@
use std::pin::pin;
+use std::sync::Arc;
use std::sync::OnceLock;
use anyhow::{bail, Error};
@@ -13,6 +14,7 @@ mod state;
pub mod top_entities;
use collection_task::{ControlMsg, MetricCollectionTask};
+use rrd_cache::RrdCache;
static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
@@ -24,8 +26,8 @@ pub fn init() -> Result<(), Error> {
let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
- let cache = rrd_cache::init(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
- rrd_cache::set_cache(cache)?;
+ let cache = RrdCache::new(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
+ rrd_cache::set_cache(Arc::new(cache))?;
Ok(())
}
diff --git a/server/src/metric_collection/rrd_cache.rs b/server/src/metric_collection/rrd_cache.rs
index 70c91ca6..e8994143 100644
--- a/server/src/metric_collection/rrd_cache.rs
+++ b/server/src/metric_collection/rrd_cache.rs
@@ -29,14 +29,14 @@ pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
// hard time to come up with a 'static reference, so we just
// wrap the cache in an `Arc` for now, solving the
// lifetime problem via refcounting.
-static RRD_CACHE: OnceCell<Arc<Cache>> = OnceCell::new();
+static RRD_CACHE: OnceCell<Arc<RrdCache>> = OnceCell::new();
/// Get the RRD cache instance
-pub fn get_cache() -> Arc<Cache> {
+pub fn get_cache() -> Arc<RrdCache> {
RRD_CACHE.get().cloned().expect("rrd cache not initialized")
}
-pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
+pub fn set_cache(cache: Arc<RrdCache>) -> Result<(), Error> {
RRD_CACHE
.set(cache)
.map_err(|_| format_err!("RRD cache already initialized!"))?;
@@ -44,107 +44,106 @@ pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
Ok(())
}
-/// Initialize the RRD cache instance
-///
-/// Note: Only a single process must do this (proxmox-datacenter-api)
-pub fn init<P: AsRef<Path>>(
- base_path: P,
- dir_options: CreateOptions,
- file_options: CreateOptions,
-) -> Result<Arc<Cache>, Error> {
- let apply_interval = 30.0 * 60.0; // 30 minutes
-
- let cache = Cache::new(
- base_path,
- Some(file_options),
- Some(dir_options),
- apply_interval,
- load_callback,
- create_callback,
- )?;
-
- cache.apply_journal()?;
-
- Ok(Arc::new(cache))
+/// Wrapper for proxmox_rrd::Cache to accomodate helper methods.
+pub struct RrdCache {
+ cache: Cache,
}
-fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
- match Database::load(path, true) {
- Ok(rrd) => Some(rrd),
- Err(err) => {
- if err.kind() != std::io::ErrorKind::NotFound {
- log::warn!("overwriting RRD file {path:?}, because of load error: {err}",);
+impl RrdCache {
+ /// Create a new RrdCache instance
+ pub fn new<P: AsRef<Path>>(
+ base_path: P,
+ dir_options: CreateOptions,
+ file_options: CreateOptions,
+ ) -> Result<Self, Error> {
+ let apply_interval = 30.0 * 60.0; // 30 minutes
+
+ let cache = Cache::new(
+ base_path,
+ Some(file_options),
+ Some(dir_options),
+ apply_interval,
+ Self::load_callback,
+ Self::create_callback,
+ )?;
+
+ cache.apply_journal()?;
+
+ Ok(Self { cache })
+ }
+
+ fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
+ match Database::load(path, true) {
+ Ok(rrd) => Some(rrd),
+ Err(err) => {
+ if err.kind() != std::io::ErrorKind::NotFound {
+ log::warn!("overwriting RRD file {path:?}, because of load error: {err}",);
+ }
+ None
}
- None
+ }
+ }
+
+ fn create_callback(dst: DataSourceType) -> Database {
+ let rra_list = vec![
+ // 1 min * 1440 => 1 day
+ Archive::new(AggregationFn::Average, 60, 1440),
+ Archive::new(AggregationFn::Maximum, 60, 1440),
+ // 30 min * 1440 => 30 days ~ 1 month
+ Archive::new(AggregationFn::Average, 30 * 60, 1440),
+ Archive::new(AggregationFn::Maximum, 30 * 60, 1440),
+ // 6 h * 1440 => 360 days ~ 1 year
+ Archive::new(AggregationFn::Average, 6 * 3600, 1440),
+ Archive::new(AggregationFn::Maximum, 6 * 3600, 1440),
+ // 1 week * 570 => 10 years
+ Archive::new(AggregationFn::Average, 7 * 86400, 570),
+ Archive::new(AggregationFn::Maximum, 7 * 86400, 570),
+ ];
+
+ Database::new(dst, rra_list)
+ }
+
+ /// Extracts data for the specified time frame from RRD cache
+ pub fn extract_data(
+ &self,
+ basedir: &str,
+ name: &str,
+ timeframe: RrdTimeframe,
+ mode: RrdMode,
+ ) -> Result<Option<proxmox_rrd::Entry>, Error> {
+ let end = proxmox_time::epoch_f64() as u64;
+
+ let (start, resolution) = match timeframe {
+ RrdTimeframe::Hour => (end - 3600, 60),
+ RrdTimeframe::Day => (end - 3600 * 24, 60),
+ RrdTimeframe::Week => (end - 3600 * 24 * 7, 30 * 60),
+ RrdTimeframe::Month => (end - 3600 * 24 * 30, 30 * 60),
+ RrdTimeframe::Year => (end - 3600 * 24 * 365, 6 * 60 * 60),
+ RrdTimeframe::Decade => (end - 10 * 3600 * 24 * 366, 7 * 86400),
+ };
+
+ let cf = match mode {
+ RrdMode::Max => AggregationFn::Maximum,
+ RrdMode::Average => AggregationFn::Average,
+ };
+
+ self.cache
+ .extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
+ }
+
+ /// Update RRD Gauge values
+ pub fn update_value(
+ &self,
+ name: &str,
+ value: f64,
+ timestamp: i64,
+ datasource_type: DataSourceType,
+ ) {
+ if let Err(err) =
+ self.cache
+ .update_value_ignore_old(name, timestamp as f64, value, datasource_type)
+ {
+ log::error!("rrd::update_value '{name}' failed - {err}");
}
}
}
-
-fn create_callback(dst: DataSourceType) -> Database {
- let rra_list = vec![
- // 1 min * 1440 => 1 day
- Archive::new(AggregationFn::Average, 60, 1440),
- Archive::new(AggregationFn::Maximum, 60, 1440),
- // 30 min * 1440 => 30 days ~ 1 month
- Archive::new(AggregationFn::Average, 30 * 60, 1440),
- Archive::new(AggregationFn::Maximum, 30 * 60, 1440),
- // 6 h * 1440 => 360 days ~ 1 year
- Archive::new(AggregationFn::Average, 6 * 3600, 1440),
- Archive::new(AggregationFn::Maximum, 6 * 3600, 1440),
- // 1 week * 570 => 10 years
- Archive::new(AggregationFn::Average, 7 * 86400, 570),
- Archive::new(AggregationFn::Maximum, 7 * 86400, 570),
- ];
-
- Database::new(dst, rra_list)
-}
-
-/// Extracts data for the specified time frame from RRD cache
-pub fn extract_data(
- rrd_cache: &Cache,
- basedir: &str,
- name: &str,
- timeframe: RrdTimeframe,
- mode: RrdMode,
-) -> Result<Option<proxmox_rrd::Entry>, Error> {
- let end = proxmox_time::epoch_f64() as u64;
-
- let (start, resolution) = match timeframe {
- RrdTimeframe::Hour => (end - 3600, 60),
- RrdTimeframe::Day => (end - 3600 * 24, 60),
- RrdTimeframe::Week => (end - 3600 * 24 * 7, 30 * 60),
- RrdTimeframe::Month => (end - 3600 * 24 * 30, 30 * 60),
- RrdTimeframe::Year => (end - 3600 * 24 * 365, 6 * 60 * 60),
- RrdTimeframe::Decade => (end - 10 * 3600 * 24 * 366, 7 * 86400),
- };
-
- let cf = match mode {
- RrdMode::Max => AggregationFn::Maximum,
- RrdMode::Average => AggregationFn::Average,
- };
-
- rrd_cache.extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
-}
-
-/// Sync/Flush the RRD journal
-pub fn sync_journal() {
- let rrd_cache = get_cache();
- if let Err(err) = rrd_cache.sync_journal() {
- log::error!("rrd_sync_journal failed - {err}");
- }
-}
-
-/// Update RRD Gauge values
-pub fn update_value(
- rrd_cache: &Cache,
- name: &str,
- value: f64,
- timestamp: i64,
- datasource_type: DataSourceType,
-) {
- if let Err(err) =
- rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
- {
- log::error!("rrd::update_value '{name}' failed - {err}");
- }
-}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 27327a29..f65a9291 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -3,12 +3,12 @@ use std::sync::Arc;
use anyhow::Error;
use tokio::sync::{mpsc::Receiver, oneshot};
-use proxmox_rrd::{rrd::DataSourceType, Cache};
+use proxmox_rrd::rrd::DataSourceType;
use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
-use super::rrd_cache;
+use super::rrd_cache::RrdCache;
/// Store request for the RRD task.
pub(super) enum RrdStoreRequest {
@@ -41,7 +41,7 @@ pub(super) struct RrdStoreResult {
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
- cache: Arc<Cache>,
+ cache: Arc<RrdCache>,
mut receiver: Receiver<RrdStoreRequest>,
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
@@ -95,7 +95,7 @@ pub(super) async fn store_in_rrd_task(
Ok(())
}
-fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetricsData) {
+fn store_metric_pve(cache: &RrdCache, remote_name: &str, data_point: &ClusterMetricsData) {
let name = format!(
"pve/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -108,8 +108,7 @@ fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetric
ClusterMetricsDataType::Derive => DataSourceType::Derive,
};
- rrd_cache::update_value(
- cache,
+ cache.update_value(
&name,
data_point.value,
data_point.timestamp,
@@ -117,7 +116,7 @@ fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetric
);
}
-fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoint) {
+fn store_metric_pbs(cache: &RrdCache, remote_name: &str, data_point: &MetricDataPoint) {
let name = format!(
"pbs/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -130,8 +129,7 @@ fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoi
MetricDataType::Derive => DataSourceType::Derive,
};
- rrd_cache::update_value(
- cache,
+ cache.update_value(
&name,
data_point.value,
data_point.timestamp,
@@ -156,7 +154,7 @@ mod tests {
// Arrange
let dir = NamedTempDir::new()?;
let options = get_create_options().perm(nix::sys::stat::Mode::from_bits_truncate(0o700));
- let cache = rrd_cache::init(&dir.path(), options.clone(), options.clone())?;
+ let cache = Arc::new(RrdCache::new(dir.path(), options, options)?);
let (tx, rx) = tokio::sync::mpsc::channel(10);
let task = store_in_rrd_task(Arc::clone(&cache), rx);
@@ -215,8 +213,7 @@ mod tests {
// There is some race condition in proxmox_rrd, in some rare cases
// extract_data does not return any data directly after writing.
- if let Some(data) = rrd_cache::extract_data(
- &cache,
+ if let Some(data) = cache.extract_data(
"pve/some-remote/node/some-node",
"cpu_current",
RrdTimeframe::Hour,
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
index f54ee72f..31e36c34 100644
--- a/server/src/metric_collection/top_entities.rs
+++ b/server/src/metric_collection/top_entities.rs
@@ -123,8 +123,7 @@ fn get_entity(
) -> Option<(f64, TopEntity)> {
let cache = rrd_cache::get_cache();
- if let Ok(Some(values)) = rrd_cache::extract_data(
- &cache,
+ if let Ok(Some(values)) = cache.extract_data(
&name,
metric,
timeframe,
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 12/23] metric collection: record remote response time in metric database
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (10 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 11/23] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 13/23] metric collection: save time needed for collection run to RRD Lukas Wagner
` (13 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This gives us the ability to retrieve max/avg response times for a given
time window.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 13 +++++++-
server/src/metric_collection/rrd_task.rs | 31 +++++++++++++++++++
2 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index f2fe42a4..17fe1e2b 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -255,6 +255,7 @@ impl MetricCollectionTask {
let (result_tx, result_rx) = oneshot::channel();
let now = proxmox_time::epoch_i64();
+ let start = Instant::now();
let res: Result<RrdStoreResult, Error> = async {
match remote.ty {
@@ -268,11 +269,16 @@ impl MetricCollectionTask {
)
.await?;
+ let duration = start.elapsed();
+
sender
.send(RrdStoreRequest::Pve {
remote: remote.id.clone(),
metrics,
channel: result_tx,
+ // TODO: use as_millis_f64 once stabilized
+ response_time: duration.as_secs_f64() * 1000.,
+ request_at: now,
})
.await?;
}
@@ -282,15 +288,20 @@ impl MetricCollectionTask {
.metrics(Some(true), Some(status.most_recent_datapoint))
.await?;
+ let duration = start.elapsed();
+
sender
.send(RrdStoreRequest::Pbs {
remote: remote.id.clone(),
metrics,
channel: result_tx,
+ // TODO: use as_millis_f64 once stabilized
+ response_time: duration.as_secs_f64() * 1000.,
+ request_at: now,
})
.await?;
}
- }
+ };
result_rx.await.map_err(Error::from)
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index f65a9291..aa1197a0 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -20,6 +20,10 @@ pub(super) enum RrdStoreRequest {
metrics: ClusterMetrics,
/// Oneshot channel to return the [`RrdStoreResult`].
channel: oneshot::Sender<RrdStoreResult>,
+ /// Reponse time in ms for the API request.
+ response_time: f64,
+ /// Timestamp at which the request was done (UNIX epoch).
+ request_at: i64,
},
/// Store PBS metrics.
Pbs {
@@ -29,6 +33,10 @@ pub(super) enum RrdStoreRequest {
metrics: Metrics,
/// Oneshot channel to return the [`RrdStoreResult`].
channel: oneshot::Sender<RrdStoreResult>,
+ /// Reponse time in ms for the API request.
+ response_time: f64,
+ /// Timestamp at which the request was done (UNIX epoch).
+ request_at: i64,
},
}
@@ -54,11 +62,14 @@ pub(super) async fn store_in_rrd_task(
remote,
metrics,
channel,
+ response_time,
+ request_at,
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
store_metric_pve(&cache_clone, &remote, &data_point);
}
+ store_response_time(&cache_clone, &remote, response_time, request_at);
channel
}
@@ -66,11 +77,14 @@ pub(super) async fn store_in_rrd_task(
remote,
metrics,
channel,
+ response_time,
+ request_at,
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
store_metric_pbs(&cache_clone, &remote, &data_point);
}
+ store_response_time(&cache_clone, &remote, response_time, request_at);
channel
}
@@ -137,6 +151,12 @@ fn store_metric_pbs(cache: &RrdCache, remote_name: &str, data_point: &MetricData
);
}
+fn store_response_time(cache: &RrdCache, remote_name: &str, response_time: f64, timestamp: i64) {
+ let name = format!("local/metric-collection/remotes/{remote_name}/response-time");
+
+ cache.update_value(&name, response_time, timestamp, DataSourceType::Gauge);
+}
+
#[cfg(test)]
mod tests {
use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
@@ -199,6 +219,8 @@ mod tests {
remote: "some-remote".into(),
metrics,
channel: tx_back,
+ response_time: 10.0,
+ request_at: now,
};
// Act
@@ -224,6 +246,15 @@ mod tests {
assert!(data.data.iter().any(Option::is_some));
}
+ if let Some(data) = cache.extract_data(
+ "local/metric-collection/remotes/some-remote",
+ "response-time",
+ RrdTimeframe::Hour,
+ RrdMode::Max,
+ )? {
+ assert!(data.data.iter().any(Option::is_some));
+ }
+
Ok(())
}
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 13/23] metric collection: save time needed for collection run to RRD
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (11 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 12/23] metric collection: record remote response time in metric database Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 14/23] metric collection: periodically clean removed remotes from statefile Lukas Wagner
` (12 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
For large setups, it might be useful to know how much time was needed to
collect metrics for *all* remotes together, e.g. for making sure that
the collection interval is not exceeded.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 14 +++++
server/src/metric_collection/rrd_task.rs | 53 ++++++++++++++-----
2 files changed, 55 insertions(+), 12 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 17fe1e2b..4013c59f 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -17,6 +17,7 @@ use proxmox_sys::fs::CreateOptions;
use pdm_api_types::remotes::{Remote, RemoteType};
+use crate::metric_collection::rrd_task::CollectionStats;
use crate::{connection, task_utils};
use super::{
@@ -90,8 +91,21 @@ impl MetricCollectionTask {
log::debug!("starting metric collection from all remotes - triggered by timer");
if let Some(remotes) = Self::load_remote_config() {
+ let now = Instant::now();
let to_fetch = remotes.iter().map(|(name, _)| name.into()).collect::<Vec<String>>();
self.fetch_remotes(&remotes, &to_fetch).await;
+ let elapsed = now.elapsed();
+
+ if let Err(err) = self.metric_data_tx.send(
+ RrdStoreRequest::CollectionStats {
+ timestamp: proxmox_time::epoch_i64(),
+ stats: CollectionStats {
+ // TODO: use as_millis_f64 once stabilized
+ total_time: elapsed.as_secs_f64() * 1000.
+ }
+ }).await {
+ log::error!("could not send collection stats to rrd task: {err}");
+ }
}
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index aa1197a0..a69099c7 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -38,6 +38,13 @@ pub(super) enum RrdStoreRequest {
/// Timestamp at which the request was done (UNIX epoch).
request_at: i64,
},
+ /// Store collection stats.
+ CollectionStats {
+ /// Timestamp at which the collection took place (UNIX epoch).
+ timestamp: i64,
+ /// Statistics.
+ stats: CollectionStats,
+ },
}
/// Result for a [`RrdStoreRequest`].
@@ -46,6 +53,12 @@ pub(super) struct RrdStoreResult {
pub(super) most_recent_timestamp: i64,
}
+/// Statistics for a (full) metric collection run.
+pub(super) struct CollectionStats {
+ /// Total time in ms.
+ pub(super) total_time: f64,
+}
+
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
@@ -57,7 +70,8 @@ pub(super) async fn store_in_rrd_task(
// Involves some blocking file IO
let res = tokio::task::spawn_blocking(move || {
let mut most_recent_timestamp = 0;
- let channel = match msg {
+
+ match msg {
RrdStoreRequest::Pve {
remote,
metrics,
@@ -71,7 +85,13 @@ pub(super) async fn store_in_rrd_task(
}
store_response_time(&cache_clone, &remote, response_time, request_at);
- channel
+ let result = RrdStoreResult {
+ most_recent_timestamp,
+ };
+
+ if channel.send(result).is_err() {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
}
RrdStoreRequest::Pbs {
remote,
@@ -86,17 +106,17 @@ pub(super) async fn store_in_rrd_task(
}
store_response_time(&cache_clone, &remote, response_time, request_at);
- channel
- }
- };
+ let result = RrdStoreResult {
+ most_recent_timestamp,
+ };
- if channel
- .send(RrdStoreResult {
- most_recent_timestamp,
- })
- .is_err()
- {
- log::error!("could not send RrdStoreStoreResult to metric collection task");
+ if channel.send(result).is_err() {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
+ }
+ RrdStoreRequest::CollectionStats { timestamp, stats } => {
+ store_stats(&cache_clone, &stats, timestamp)
+ }
};
})
.await;
@@ -157,6 +177,15 @@ fn store_response_time(cache: &RrdCache, remote_name: &str, response_time: f64,
cache.update_value(&name, response_time, timestamp, DataSourceType::Gauge);
}
+fn store_stats(cache: &RrdCache, stats: &CollectionStats, timestamp: i64) {
+ cache.update_value(
+ "local/metric-collection/total-time",
+ stats.total_time,
+ timestamp,
+ DataSourceType::Gauge,
+ );
+}
+
#[cfg(test)]
mod tests {
use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 14/23] metric collection: periodically clean removed remotes from statefile
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (12 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 13/23] metric collection: save time needed for collection run to RRD Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 15/23] api: add endpoint to trigger metric collection Lukas Wagner
` (11 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
Adding and removing remotes can leave leftover data in the statefile,
hence it makes sense to clean it up periodically.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 6 ++++
server/src/metric_collection/state.rs | 32 +++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 4013c59f..b960efff 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -91,6 +91,8 @@ impl MetricCollectionTask {
log::debug!("starting metric collection from all remotes - triggered by timer");
if let Some(remotes) = Self::load_remote_config() {
+ self.cleanup_removed_remotes_from_state(&remotes);
+
let now = Instant::now();
let to_fetch = remotes.iter().map(|(name, _)| name.into()).collect::<Vec<String>>();
self.fetch_remotes(&remotes, &to_fetch).await;
@@ -136,6 +138,10 @@ impl MetricCollectionTask {
}
}
+ fn cleanup_removed_remotes_from_state(&mut self, remotes: &SectionConfigData<Remote>) {
+ self.state.retain(|remote| remotes.get(remote).is_some());
+ }
+
/// Set up a [`tokio::time::Interval`] instance with the provided interval.
/// The timer will be aligned, e.g. an interval of `60` will let the timer
/// fire at minute boundaries.
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
index e76444a9..7f68843e 100644
--- a/server/src/metric_collection/state.rs
+++ b/server/src/metric_collection/state.rs
@@ -74,6 +74,11 @@ impl MetricCollectionState {
Ok(())
}
+ /// Retain all remotes for which the predicate holds.
+ pub fn retain(&mut self, check: impl Fn(&str) -> bool) {
+ self.state.remote_status.retain(|remote, _| check(remote));
+ }
+
fn load_or_default(path: &Path) -> Result<State, Error> {
let content = proxmox_sys::fs::file_read_optional_string(path)?;
@@ -115,4 +120,31 @@ mod tests {
Ok(())
}
+
+ #[test]
+ fn test_retain() -> Result<(), Error> {
+ let file = NamedTempFile::new(get_create_options())?;
+ let options = get_create_options();
+ let mut state = MetricCollectionState::new(file.path().into(), options);
+
+ state.set_status(
+ "remote-1".into(),
+ RemoteStatus {
+ ..Default::default()
+ },
+ );
+ state.set_status(
+ "remote-2".into(),
+ RemoteStatus {
+ ..Default::default()
+ },
+ );
+
+ state.retain(|remote| remote == "remote-1");
+
+ assert!(state.get_status("remote-1").is_some());
+ assert!(state.get_status("remote-2").is_none());
+
+ Ok(())
+ }
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 15/23] api: add endpoint to trigger metric collection
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (13 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 14/23] metric collection: periodically clean removed remotes from statefile Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 16/23] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
` (10 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This one lives under /metric-collection/trigger.
Doing a POST will trigger metric collection for a remote provided in the
'remote' parameter, or all remotes if the parameter is not provided.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
server/src/api/metric_collection.rs | 35 +++++++++++++++++++++++++++++
server/src/api/mod.rs | 2 ++
2 files changed, 37 insertions(+)
create mode 100644 server/src/api/metric_collection.rs
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
new file mode 100644
index 00000000..1a9e0e28
--- /dev/null
+++ b/server/src/api/metric_collection.rs
@@ -0,0 +1,35 @@
+use anyhow::Error;
+
+use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
+use proxmox_router::{Router, SubdirMap};
+use proxmox_schema::api;
+use proxmox_sortable_macro::sortable;
+
+pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
+
+#[sortable]
+const SUBDIRS: SubdirMap = &sorted!([(
+ "trigger",
+ &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+),]);
+
+#[api(
+ input: {
+ properties: {
+ remote: {
+ schema: REMOTE_ID_SCHEMA,
+ optional: true,
+ },
+ },
+ },
+)]
+/// Trigger metric collection for a provided remote or for all remotes if no remote is passed.
+pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+ if let Some(remote) = remote {
+ crate::metric_collection::trigger_metric_collection_for_remote(remote).await?;
+ } else {
+ crate::metric_collection::trigger_metric_collection().await?;
+ }
+
+ Ok(())
+}
diff --git a/server/src/api/mod.rs b/server/src/api/mod.rs
index 6c4831b4..ff875fc8 100644
--- a/server/src/api/mod.rs
+++ b/server/src/api/mod.rs
@@ -9,6 +9,7 @@ use proxmox_sortable_macro::sortable;
pub mod access;
pub mod config;
+pub mod metric_collection;
pub mod nodes;
pub mod pbs;
pub mod pve;
@@ -21,6 +22,7 @@ mod rrd_common;
const SUBDIRS: SubdirMap = &sorted!([
("access", &access::ROUTER),
("config", &config::ROUTER),
+ ("metric-collection", &metric_collection::ROUTER),
("ping", &Router::new().get(&API_METHOD_PING)),
("pve", &pve::ROUTER),
("pbs", &pbs::ROUTER),
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 16/23] api: remotes: trigger immediate metric collection for newly added nodes
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (14 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 15/23] api: add endpoint to trigger metric collection Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 17/23] api: add api for querying metric collection RRD data Lukas Wagner
` (9 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This makes sure that metric graphs are available right after adding a
new remote.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
server/src/api/remotes.rs | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index 8ff1ada0..27b91cfe 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -18,6 +18,7 @@ use proxmox_time::{epoch_i64, epoch_to_rfc2822};
use pdm_api_types::remotes::{Remote, RemoteType, RemoteUpdater, REMOTE_ID_SCHEMA};
use pdm_api_types::{Authid, ConfigDigest, PRIV_RESOURCE_AUDIT, PRIV_RESOURCE_MODIFY};
+use crate::metric_collection;
use crate::{connection, pbs_client};
use super::pve;
@@ -171,10 +172,15 @@ pub async fn add_remote(mut entry: Remote, create_token: Option<String>) -> Resu
entry.token = token;
}
+ let name = entry.id.clone();
remotes.insert(entry.id.to_owned(), entry);
pdm_config::remotes::save_config(&remotes)?;
+ if let Err(e) = metric_collection::trigger_metric_collection_for_remote(name).await {
+ log::error!("could not trigger metric collection after adding remote: {e}");
+ }
+
Ok(())
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 17/23] api: add api for querying metric collection RRD data
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (15 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 16/23] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 18/23] api: metric-collection: add status endpoint Lukas Wagner
` (8 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This commit adds two new API endpoints:
- remotes/{id}/metric-collection-rrddata
- metric-collection/rrddata
The first one returns graphable datapoints for the API response time
when fetching metrics, the second one datapoints about the total time
needed for collecting *all* remotes.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
lib/pdm-api-types/src/rrddata.rs | 26 +++++++++++++
server/src/api/metric_collection.rs | 60 ++++++++++++++++++++++++++---
server/src/api/remotes.rs | 53 +++++++++++++++++++++++++
3 files changed, 134 insertions(+), 5 deletions(-)
diff --git a/lib/pdm-api-types/src/rrddata.rs b/lib/pdm-api-types/src/rrddata.rs
index a973977c..7559243c 100644
--- a/lib/pdm-api-types/src/rrddata.rs
+++ b/lib/pdm-api-types/src/rrddata.rs
@@ -216,3 +216,29 @@ pub struct PbsDatastoreDataPoint {
#[serde(skip_serializing_if = "Option::is_none")]
pub disk_write: Option<f64>,
}
+
+#[api]
+#[derive(Serialize, Deserialize, Default)]
+#[serde(rename_all = "kebab-case")]
+/// RRD datapoint for statistics about the metric collection loop.
+pub struct FullCollectionDatapoint {
+ /// Timestamp (UNIX epoch)
+ pub time: u64,
+
+ /// Total time in milliseconds needed for full metric collection run.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub total_time: Option<f64>,
+}
+
+#[api]
+#[derive(Serialize, Deserialize, Default)]
+#[serde(rename_all = "kebab-case")]
+/// RRD datapoint for metric collection from a single remote.
+pub struct RemoteCollectionDatapoint {
+ /// Timestamp (UNIX epoch)
+ pub time: u64,
+
+ /// API response time in milliseconds when requesting the metrics from the remote.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub response_time: Option<f64>,
+}
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 1a9e0e28..22736cb0 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -1,17 +1,27 @@
use anyhow::Error;
-use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
use proxmox_router::{Router, SubdirMap};
+use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
use proxmox_schema::api;
use proxmox_sortable_macro::sortable;
+use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint};
+
+use super::rrd_common::{self, DataPoint};
+
pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
#[sortable]
-const SUBDIRS: SubdirMap = &sorted!([(
- "trigger",
- &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
-),]);
+const SUBDIRS: SubdirMap = &sorted!([
+ (
+ "trigger",
+ &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+ ),
+ (
+ "rrddata",
+ &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_RRD_DATA)
+ ),
+]);
#[api(
input: {
@@ -33,3 +43,43 @@ pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Err
Ok(())
}
+
+impl DataPoint for FullCollectionDatapoint {
+ fn new(time: u64) -> Self {
+ Self {
+ time,
+ ..Default::default()
+ }
+ }
+
+ fn fields() -> &'static [&'static str] {
+ &["total-time"]
+ }
+
+ fn set_field(&mut self, name: &str, value: f64) {
+ if name == "total-time" {
+ self.total_time = Some(value);
+ }
+ }
+}
+
+#[api(
+ input: {
+ properties: {
+ timeframe: {
+ type: RrdTimeframe,
+ },
+ cf: {
+ type: RrdMode,
+ },
+ },
+ },
+)]
+/// Read metric collection RRD data.
+fn get_metric_collection_rrd_data(
+ timeframe: RrdTimeframe,
+ cf: RrdMode,
+) -> Result<Vec<FullCollectionDatapoint>, Error> {
+ let base = "local/metric-collection";
+ rrd_common::create_datapoints_from_rrd(base, timeframe, cf)
+}
diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index 27b91cfe..bbb24b26 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -9,6 +9,8 @@ use proxmox_access_control::CachedUserInfo;
use proxmox_router::{
http_bail, http_err, list_subdirs_api_method, Permission, Router, RpcEnvironment, SubdirMap,
};
+use proxmox_rrd_api_types::RrdMode;
+use proxmox_rrd_api_types::RrdTimeframe;
use proxmox_schema::api;
use proxmox_schema::Schema;
use proxmox_section_config::typed::SectionConfigData;
@@ -16,12 +18,15 @@ use proxmox_sortable_macro::sortable;
use proxmox_time::{epoch_i64, epoch_to_rfc2822};
use pdm_api_types::remotes::{Remote, RemoteType, RemoteUpdater, REMOTE_ID_SCHEMA};
+use pdm_api_types::rrddata::RemoteCollectionDatapoint;
use pdm_api_types::{Authid, ConfigDigest, PRIV_RESOURCE_AUDIT, PRIV_RESOURCE_MODIFY};
use crate::metric_collection;
use crate::{connection, pbs_client};
use super::pve;
+use super::rrd_common;
+use super::rrd_common::DataPoint;
pub const ROUTER: Router = Router::new()
.get(&API_METHOD_LIST_REMOTES)
@@ -38,6 +43,10 @@ const ITEM_ROUTER: Router = Router::new()
const SUBDIRS: SubdirMap = &sorted!([
("config", &Router::new().get(&API_METHOD_REMOTE_CONFIG)),
("version", &Router::new().get(&API_METHOD_VERSION)),
+ (
+ "metric-collection-rrddata",
+ &Router::new().get(&API_METHOD_GET_PER_REMOTE_METRIC_COLLECTION_RRD_DATA)
+ ),
]);
pub fn get_remote<'a>(
@@ -331,3 +340,47 @@ pub fn remote_config(id: String) -> Result<Remote, Error> {
remote.token = String::new(); // mask token in response
Ok(remote.clone())
}
+
+impl DataPoint for RemoteCollectionDatapoint {
+ fn new(time: u64) -> Self {
+ Self {
+ time,
+ ..Default::default()
+ }
+ }
+
+ fn fields() -> &'static [&'static str] {
+ &["response-time"]
+ }
+
+ fn set_field(&mut self, name: &str, value: f64) {
+ if name == "response-time" {
+ self.response_time = Some(value);
+ }
+ }
+}
+
+#[api(
+ input: {
+ properties: {
+ id: {
+ schema: REMOTE_ID_SCHEMA,
+ },
+ timeframe: {
+ type: RrdTimeframe,
+ },
+ cf: {
+ type: RrdMode,
+ },
+ },
+ },
+)]
+/// Read metric collection RRD data.
+fn get_per_remote_metric_collection_rrd_data(
+ id: String,
+ timeframe: RrdTimeframe,
+ cf: RrdMode,
+) -> Result<Vec<RemoteCollectionDatapoint>, Error> {
+ let base = format!("local/metric-collection/remotes/{id}");
+ rrd_common::create_datapoints_from_rrd(&base, timeframe, cf)
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 18/23] api: metric-collection: add status endpoint
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (16 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 17/23] api: add api for querying metric collection RRD data Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 19/23] pdm-client: add metric collection API methods Lukas Wagner
` (7 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This adds a new API endpoint at /metric-collection/status, returning the
status of the last metric collection attempt from each remote.
For now, this contains the timestamp (last-collected) and any
error that occurred (error).
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
Notes:
Changes since v2:
- use `_` for type casts
- Drop max-concurrency, and delay settings for now
lib/pdm-api-types/src/lib.rs | 3 +++
lib/pdm-api-types/src/metric_collection.rs | 20 ++++++++++++++++++++
server/src/api/metric_collection.rs | 16 +++++++++++++++-
server/src/metric_collection/mod.rs | 22 ++++++++++++++++++++++
4 files changed, 60 insertions(+), 1 deletion(-)
create mode 100644 lib/pdm-api-types/src/metric_collection.rs
diff --git a/lib/pdm-api-types/src/lib.rs b/lib/pdm-api-types/src/lib.rs
index 9373725c..0a438188 100644
--- a/lib/pdm-api-types/src/lib.rs
+++ b/lib/pdm-api-types/src/lib.rs
@@ -19,6 +19,9 @@ pub use acl::*;
mod node_config;
pub use node_config::*;
+mod metric_collection;
+pub use metric_collection::*;
+
mod proxy;
pub use proxy::HTTP_PROXY_SCHEMA;
diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
new file mode 100644
index 00000000..5279c8a4
--- /dev/null
+++ b/lib/pdm-api-types/src/metric_collection.rs
@@ -0,0 +1,20 @@
+//! API types for metric collection.
+
+use serde::{Deserialize, Serialize};
+
+use proxmox_schema::api;
+
+#[api]
+#[derive(Clone, Deserialize, Serialize)]
+#[serde(rename_all = "kebab-case")]
+/// Per-remote collection status.
+pub struct MetricCollectionStatus {
+ /// The remote's name.
+ pub remote: String,
+ /// Any error that occured during the last collection attempt.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub error: Option<String>,
+ /// Timestamp of last successful collection.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub last_collection: Option<i64>,
+}
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 22736cb0..a8243296 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -5,7 +5,11 @@ use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
use proxmox_schema::api;
use proxmox_sortable_macro::sortable;
-use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint};
+use pdm_api_types::{
+ remotes::REMOTE_ID_SCHEMA, rrddata::FullCollectionDatapoint, MetricCollectionStatus,
+};
+
+use crate::metric_collection;
use super::rrd_common::{self, DataPoint};
@@ -21,6 +25,10 @@ const SUBDIRS: SubdirMap = &sorted!([
"rrddata",
&Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_RRD_DATA)
),
+ (
+ "status",
+ &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_STATUS)
+ ),
]);
#[api(
@@ -83,3 +91,9 @@ fn get_metric_collection_rrd_data(
let base = "local/metric-collection";
rrd_common::create_datapoints_from_rrd(base, timeframe, cf)
}
+
+#[api]
+/// Read metric collection RRD data.
+fn get_metric_collection_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+ metric_collection::get_status()
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 5b6c14d2..9f832ded 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -7,6 +7,8 @@ use tokio::sync::mpsc::{self, Sender};
use proxmox_sys::fs::CreateOptions;
+use pdm_api_types::MetricCollectionStatus;
+
mod collection_task;
pub mod rrd_cache;
mod rrd_task;
@@ -87,3 +89,23 @@ pub async fn trigger_metric_collection() -> Result<(), Error> {
Ok(())
}
+
+/// Get each remote's metric collection status.
+pub fn get_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+ let (remotes, _) = pdm_config::remotes::config()?;
+ let state = collection_task::load_state()?;
+
+ let mut result = Vec::new();
+
+ for (remote, _) in remotes.into_iter() {
+ if let Some(status) = state.get_status(&remote) {
+ result.push(MetricCollectionStatus {
+ remote,
+ error: status.error.clone(),
+ last_collection: status.last_collection,
+ })
+ }
+ }
+
+ Ok(result)
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 19/23] pdm-client: add metric collection API methods
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (17 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 18/23] api: metric-collection: add status endpoint Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 20/23] cli: add commands for metric-collection trigger and status Lukas Wagner
` (6 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This adds bindings for collection settings, trigger, and status APIs.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
lib/pdm-client/src/lib.rs | 58 +++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/lib/pdm-client/src/lib.rs b/lib/pdm-client/src/lib.rs
index 61a8ebd9..86010d67 100644
--- a/lib/pdm-client/src/lib.rs
+++ b/lib/pdm-client/src/lib.rs
@@ -317,6 +317,64 @@ impl<T: HttpApiClient> PdmClient<T> {
.ok_or_else(|| Error::BadApi("api returned no webauthn entry id".to_string(), None))
}
+ /// Trigger metric collection for a single remote or for all remotes, if no remote is provided.
+ pub async fn trigger_metric_collection(
+ &self,
+ remote: Option<&str>,
+ ) -> Result<(), proxmox_client::Error> {
+ let path = "/api2/extjs/metric-collection/trigger";
+
+ #[derive(Serialize)]
+ struct TriggerParams<'a> {
+ #[serde(skip_serializing_if = "Option::is_none")]
+ remote: Option<&'a str>,
+ }
+
+ self.0
+ .post(path, &TriggerParams { remote })
+ .await?
+ .nodata()?;
+
+ Ok(())
+ }
+
+ /// Get global metric collection status.
+ pub async fn get_metric_collection_status(
+ &self,
+ ) -> Result<Vec<pdm_api_types::MetricCollectionStatus>, Error> {
+ let path = "/api2/extjs/metric-collection/status";
+ Ok(self.0.get(path).await?.expect_json()?.data)
+ }
+
+ /// Get global metric collection status.
+ pub async fn get_metric_collection_rrddata(
+ &self,
+ mode: RrdMode,
+ timeframe: RrdTimeframe,
+ ) -> Result<pdm_api_types::rrddata::FullCollectionDatapoint, Error> {
+ let path = ApiPathBuilder::new("/api2/extjs/metric-collection/rrddata")
+ .arg("cf", mode)
+ .arg("timeframe", timeframe)
+ .build();
+ Ok(self.0.get(&path).await?.expect_json()?.data)
+ }
+
+ /// Get per-remote metric collection status.
+ pub async fn get_per_remote_metric_collection_rrddata(
+ &self,
+ remote: &str,
+ mode: RrdMode,
+ timeframe: RrdTimeframe,
+ ) -> Result<pdm_api_types::rrddata::RemoteCollectionDatapoint, Error> {
+ let path = ApiPathBuilder::new(format!(
+ "/api2/extjs/remotes/{remote}/metric-collection-rrddata"
+ ))
+ .arg("cf", mode)
+ .arg("timeframe", timeframe)
+ .build();
+ Ok(self.0.get(&path).await?.expect_json()?.data)
+ }
+
pub async fn pve_list_nodes(
&self,
remote: &str,
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 20/23] cli: add commands for metric-collection trigger and status
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (18 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 19/23] pdm-client: add metric collection API methods Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 21/23] metric collection: factor out handle_tick and handle_control_message fns Lukas Wagner
` (5 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This adds new commands to the proxmox-datacenter-client CLI tool, namely
- trigger metric collection
- show status of the last metric collection
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
Notes:
Changes since v1:
- update-settings -> 'settings update'
- show-settings -> 'settings show'
cli/client/Cargo.toml | 1 +
cli/client/src/main.rs | 2 +
cli/client/src/metric_collection.rs | 70 +++++++++++++++++++++++++++++
3 files changed, 73 insertions(+)
create mode 100644 cli/client/src/metric_collection.rs
diff --git a/cli/client/Cargo.toml b/cli/client/Cargo.toml
index 3de3bf3f..c2aac874 100644
--- a/cli/client/Cargo.toml
+++ b/cli/client/Cargo.toml
@@ -46,6 +46,7 @@ proxmox-rrd-api-types.workspace = true
proxmox-schema = { workspace = true, features = [ "api-macro" ] }
proxmox-section-config.workspace = true
proxmox-sys.workspace = true
+proxmox-time.workspace = true
proxmox-tfa = { workspace = true, features = [ "types" ] }
# for the pve API types:
diff --git a/cli/client/src/main.rs b/cli/client/src/main.rs
index b1cc2ad5..27fd3744 100644
--- a/cli/client/src/main.rs
+++ b/cli/client/src/main.rs
@@ -17,6 +17,7 @@ pub mod env;
pub mod acl;
pub mod config;
+pub mod metric_collection;
pub mod pbs;
pub mod pve;
pub mod remotes;
@@ -96,6 +97,7 @@ fn main_do() -> Result<(), Error> {
)
.insert("acl", acl::cli())
.insert("login", CliCommand::new(&API_METHOD_LOGIN))
+ .insert("metric-collection", metric_collection::cli())
.insert("pbs", pbs::cli())
.insert("pve", pve::cli())
.insert("remote", remotes::cli())
diff --git a/cli/client/src/metric_collection.rs b/cli/client/src/metric_collection.rs
new file mode 100644
index 00000000..e9dbd804
--- /dev/null
+++ b/cli/client/src/metric_collection.rs
@@ -0,0 +1,70 @@
+use anyhow::Error;
+use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
+use proxmox_router::cli::{
+ format_and_print_result, CliCommand, CliCommandMap, CommandLineInterface, OutputFormat,
+};
+use proxmox_schema::api;
+
+use crate::{client, env};
+
+pub fn cli() -> CommandLineInterface {
+ CliCommandMap::new()
+ .insert(
+ "trigger",
+ CliCommand::new(&API_METHOD_TRIGGER_METRIC_COLLECTION),
+ )
+ .insert(
+ "status",
+ CliCommand::new(&API_METHOD_METRIC_COLLECTION_STATUS),
+ )
+ .into()
+}
+
+#[api(
+ input: {
+ properties: {
+ remote: {
+ schema: REMOTE_ID_SCHEMA,
+ optional: true,
+ },
+ }
+ }
+)]
+/// Trigger metric collection. If a remote is passed, only this remote will be collected, otherwise
+/// all.
+async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+ client()?
+ .trigger_metric_collection(remote.as_deref())
+ .await?;
+ Ok(())
+}
+
+#[api]
+/// Show metric collection status.
+async fn metric_collection_status() -> Result<(), Error> {
+ let result = client()?.get_metric_collection_status().await?;
+
+ let output_format = env().format_args.output_format;
+ if output_format == OutputFormat::Text {
+ for remote_status in result {
+ let timestamp = if let Some(last_collection) = remote_status.last_collection {
+ proxmox_time::strftime_local("%a, %d %b %Y %T %z", last_collection)?
+ } else {
+ "never".into()
+ };
+
+ let status = if let Some(err) = &remote_status.error {
+ err
+ } else {
+ "ok"
+ };
+
+ println!("{}: {status}", remote_status.remote);
+ println!(" last successful: {timestamp}");
+ println!();
+ }
+ } else {
+ format_and_print_result(&result, &output_format.to_string());
+ }
+ Ok(())
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 21/23] metric collection: factor out handle_tick and handle_control_message fns
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (19 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 20/23] cli: add commands for metric-collection trigger and status Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 22/23] metric collection: skip missed timer ticks Lukas Wagner
` (4 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
No functional changes, just moving some code into seperate methods for
better readability.
Suggested-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
Notes:
New in v2.
.../src/metric_collection/collection_task.rs | 93 +++++++++++--------
1 file changed, 54 insertions(+), 39 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index b960efff..db65af24 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -88,47 +88,11 @@ impl MetricCollectionTask {
loop {
tokio::select! {
_ = timer.tick() => {
- log::debug!("starting metric collection from all remotes - triggered by timer");
-
- if let Some(remotes) = Self::load_remote_config() {
- self.cleanup_removed_remotes_from_state(&remotes);
-
- let now = Instant::now();
- let to_fetch = remotes.iter().map(|(name, _)| name.into()).collect::<Vec<String>>();
- self.fetch_remotes(&remotes, &to_fetch).await;
- let elapsed = now.elapsed();
-
- if let Err(err) = self.metric_data_tx.send(
- RrdStoreRequest::CollectionStats {
- timestamp: proxmox_time::epoch_i64(),
- stats: CollectionStats {
- // TODO: use as_millis_f64 once stabilized
- total_time: elapsed.as_secs_f64() * 1000.
- }
- }).await {
- log::error!("could not send collection stats to rrd task: {err}");
- }
- }
+ self.handle_tick().await;
}
- val = self.control_message_rx.recv() => {
- // Reload settings in case they have changed in the meanwhile
- match val {
- Some(ControlMsg::CollectSingleRemote(remote)) => {
- if let Some(remotes) = Self::load_remote_config() {
- log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
- self.fetch_remotes(&remotes, &[remote]).await;
- }
- }
- Some(ControlMsg::CollectAllRemotes) => {
- if let Some(remotes) = Self::load_remote_config() {
- log::debug!("starting metric collection from all remotes - triggered by control message");
- let to_fetch = remotes.iter().map(|(name, _)| name.into()).collect::<Vec<String>>();
- self.fetch_remotes(&remotes, &to_fetch).await;
- }
- }
- _ => {},
- }
+ Some(message) = self.control_message_rx.recv() => {
+ self.handle_control_message(message).await;
}
}
@@ -138,6 +102,57 @@ impl MetricCollectionTask {
}
}
+ /// Handle a timer tick.
+ async fn handle_tick(&mut self) {
+ log::debug!("starting metric collection from all remotes - triggered by timer");
+
+ if let Some(remotes) = Self::load_remote_config() {
+ self.cleanup_removed_remotes_from_state(&remotes);
+
+ let now = Instant::now();
+ let to_fetch = remotes
+ .iter()
+ .map(|(name, _)| name.into())
+ .collect::<Vec<String>>();
+ self.fetch_remotes(&remotes, &to_fetch).await;
+ let elapsed = now.elapsed();
+
+ if let Err(err) = self
+ .metric_data_tx
+ .send(RrdStoreRequest::CollectionStats {
+ timestamp: proxmox_time::epoch_i64(),
+ stats: CollectionStats {
+ // TODO: use as_millis_f64 once stabilized
+ total_time: elapsed.as_secs_f64() * 1000.,
+ },
+ })
+ .await
+ {
+ log::error!("could not send collection stats to rrd task: {err}");
+ }
+ }
+ }
+
+ /// Handle a control message for force-triggered collection.
+ async fn handle_control_message(&mut self, message: ControlMsg) {
+ if let Some(remotes) = Self::load_remote_config() {
+ match message {
+ ControlMsg::CollectSingleRemote(remote) => {
+ log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
+ self.fetch_remotes(&remotes, &[remote]).await;
+ }
+ ControlMsg::CollectAllRemotes => {
+ log::debug!("starting metric collection from all remotes - triggered by control message");
+ let to_fetch = remotes
+ .iter()
+ .map(|(name, _)| name.into())
+ .collect::<Vec<String>>();
+ self.fetch_remotes(&remotes, &to_fetch).await;
+ }
+ }
+ }
+ }
+
fn cleanup_removed_remotes_from_state(&mut self, remotes: &SectionConfigData<Remote>) {
self.state.retain(|remote| remotes.get(remote).is_some());
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 22/23] metric collection: skip missed timer ticks
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (20 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 21/23] metric collection: factor out handle_tick and handle_control_message fns Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 23/23] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
` (3 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
The default behavior of `tokio::time::Interval` when a tick is missed
is 'burst', which is to shorten the timer intervals until the original
alignment is restored, while keeping the *number of ticks* the same
the same as expected.
Example from the official tokio docs [1] for burst mode:
Expected: | 1 | 2 | 3 | 4 | 5 | 6 |
Actual: | work ---| delay |work|work|work-|work-----|
For metric collection, we do not really gain anything from bursting
missed ticks. For us, 'skip' is fine. There, the alignment is
immediately restored by skipping any ticks that have been missed:
Expected: | 1 | 2 | 3 | 4 | 5 | 6 |
Actual: | work ---| delay |work---| work----| work----|
https://docs.rs/tokio/latest/tokio/time/enum.MissedTickBehavior.html
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
Notes:
New in v2.
server/src/metric_collection/collection_task.rs | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index db65af24..59493162 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -9,7 +9,7 @@ use tokio::{
mpsc::{Receiver, Sender},
oneshot, OwnedSemaphorePermit, Semaphore,
},
- time::Interval,
+ time::{Interval, MissedTickBehavior},
};
use proxmox_section_config::typed::SectionConfigData;
@@ -166,6 +166,13 @@ impl MetricCollectionTask {
fn setup_timer(interval: u64) -> (Interval, Instant) {
log::debug!("setting metric collection interval timer to {interval} seconds.",);
let mut timer = tokio::time::interval(Duration::from_secs(interval));
+
+ // If we miss a tick because a previous collection run took too long, we want to
+ // tick as soon as possible, but we do not need to repeat missing ticks.
+ // We do want to stay aligned, though.
+ // https://docs.rs/tokio/latest/tokio/time/enum.MissedTickBehavior.html#variant.Skip
+ timer.set_missed_tick_behavior(MissedTickBehavior::Skip);
+
let first_run = task_utils::next_aligned_instant(interval);
timer.reset_at(first_run.into());
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v6 23/23] metric collection: use JoinSet instead of joining from handles in a Vec
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (21 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 22/23] metric collection: skip missed timer ticks Lukas Wagner
@ 2025-08-21 9:53 ` Lukas Wagner
2025-08-21 12:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Dominik Csapak
` (2 subsequent siblings)
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 9:53 UTC (permalink / raw)
To: pdm-devel
This lets us process finished tasks in the order they finish, not in the
order they were spawned.
Suggested-by: Wolfang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
Notes:
New in v2.
.../src/metric_collection/collection_task.rs | 25 ++++++++-----------
1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 59493162..da6cc850 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -9,6 +9,7 @@ use tokio::{
mpsc::{Receiver, Sender},
oneshot, OwnedSemaphorePermit, Semaphore,
},
+ task::JoinSet,
time::{Interval, MissedTickBehavior},
};
@@ -201,7 +202,8 @@ impl MetricCollectionTask {
remotes_to_fetch: &[String],
) {
let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT_CONNECTIONS));
- let mut handles = Vec::new();
+ let mut handles = JoinSet::new();
+
let now = proxmox_time::epoch_i64();
for remote_name in remotes_to_fetch {
@@ -224,29 +226,22 @@ impl MetricCollectionTask {
if let Some(remote) = remote_config.get(remote_name).cloned() {
log::debug!("fetching remote '{}'", remote.id);
- let handle = tokio::spawn(Self::fetch_single_remote(
+ handles.spawn(Self::fetch_single_remote(
remote,
status,
self.metric_data_tx.clone(),
permit,
));
-
- handles.push((remote_name.clone(), handle));
}
}
- for (remote_name, handle) in handles {
- let res = handle.await;
-
+ while let Some(res) = handles.join_next().await {
match res {
- Ok(Ok(ts)) => {
- self.state.set_status(remote_name, ts);
+ Ok((name, status)) => {
+ self.state.set_status(name, status);
}
- Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
Err(err) => {
- log::error!(
- "join error for metric collection task for remote {remote_name}: {err}"
- )
+ log::error!("join error for metric collection task for remote: {err}")
}
}
}
@@ -293,7 +288,7 @@ impl MetricCollectionTask {
mut status: RemoteStatus,
sender: Sender<RrdStoreRequest>,
_permit: OwnedSemaphorePermit,
- ) -> Result<RemoteStatus, Error> {
+ ) -> (String, RemoteStatus) {
let (result_tx, result_rx) = oneshot::channel();
let now = proxmox_time::epoch_i64();
@@ -361,7 +356,7 @@ impl MetricCollectionTask {
}
}
- Ok(status)
+ (remote.id, status)
}
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (22 preceding siblings ...)
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 23/23] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
@ 2025-08-21 12:53 ` Dominik Csapak
2025-08-21 13:46 ` Lukas Wagner
2025-08-27 7:19 ` Thomas Lamprecht
2025-08-22 11:51 ` Dominik Csapak
2025-08-26 13:53 ` [pdm-devel] superseded: " Lukas Wagner
25 siblings, 2 replies; 32+ messages in thread
From: Dominik Csapak @ 2025-08-21 12:53 UTC (permalink / raw)
To: Proxmox Datacenter Manager development discussion, Lukas Wagner
just a first high level question while i'm testing/reviewing this:
when using this patch, it seems the collection interval is much reduced?
e.g. in the gui I'm missing up to the last 10 minutes now?
(at 14:49 the last point i have for the rrd is from 14:40)
is this by design? i get that we don't want to pull too often,
but showing up to 10minutes out of date graphs is also not practical?
my naive solution would be to proxy the rrd requests to the
pve nodes directly, but then why would we need the metric
collection in the first place?
i guess i'm just missing something here, maybe you can point it out for me?
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
2025-08-21 12:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Dominik Csapak
@ 2025-08-21 13:46 ` Lukas Wagner
2025-08-22 11:27 ` Dominik Csapak
2025-08-27 7:19 ` Thomas Lamprecht
1 sibling, 1 reply; 32+ messages in thread
From: Lukas Wagner @ 2025-08-21 13:46 UTC (permalink / raw)
To: Dominik Csapak, Proxmox Datacenter Manager development discussion
On Thu Aug 21, 2025 at 2:53 PM CEST, Dominik Csapak wrote:
> just a first high level question while i'm testing/reviewing this:
>
> when using this patch, it seems the collection interval is much reduced?
>
> e.g. in the gui I'm missing up to the last 10 minutes now?
> (at 14:49 the last point i have for the rrd is from 14:40)
>
> is this by design? i get that we don't want to pull too often,
> but showing up to 10minutes out of date graphs is also not practical?
Maybe I misremember, but I vaguely recall Thomas and Dietmar
independently mentioning that the default metric polling interval should
be higher than what is implemented right now (1min).
Anyway, at some point the interval should be configurable (already was
in earlier versions of this series, but I dropped these patches for now
since it's not 100% clear yet to me *where* we want to configure these
things/store the settings). So then the question is what *default* to
use - for now I settled for 10mins. But I'm open for better values, I
don't have hard feelings about this.
>
> my naive solution would be to proxy the rrd requests to the
> pve nodes directly, but then why would we need the metric
> collection in the first place?
We also could trigger an out-of-schedule metric collection for a remote
when the RRD graph calls the rrddata endpoint (the functions for
triggering the collection of a single remote are already there, albeit
non-blocking, so this would need some changes). Fetching the missing
data for a single remote should be fast enough any way. The rrddata
endpoint could have some timeout for waiting for the results of
collecting that single remote; if that one is exceeded we don't wait
until the collection results are done but simply return the existing
data.
>
> i guess i'm just missing something here, maybe you can point it out for me?
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
2025-08-21 13:46 ` Lukas Wagner
@ 2025-08-22 11:27 ` Dominik Csapak
0 siblings, 0 replies; 32+ messages in thread
From: Dominik Csapak @ 2025-08-22 11:27 UTC (permalink / raw)
To: Lukas Wagner, Proxmox Datacenter Manager development discussion
On 8/21/25 3:46 PM, Lukas Wagner wrote:
> On Thu Aug 21, 2025 at 2:53 PM CEST, Dominik Csapak wrote:
>> just a first high level question while i'm testing/reviewing this:
>>
>> when using this patch, it seems the collection interval is much reduced?
>>
>> e.g. in the gui I'm missing up to the last 10 minutes now?
>> (at 14:49 the last point i have for the rrd is from 14:40)
>>
>> is this by design? i get that we don't want to pull too often,
>> but showing up to 10minutes out of date graphs is also not practical?
>
> Maybe I misremember, but I vaguely recall Thomas and Dietmar
> independently mentioning that the default metric polling interval should
> be higher than what is implemented right now (1min).
>
> Anyway, at some point the interval should be configurable (already was
> in earlier versions of this series, but I dropped these patches for now
> since it's not 100% clear yet to me *where* we want to configure these
> things/store the settings). So then the question is what *default* to
> use - for now I settled for 10mins. But I'm open for better values, I
> don't have hard feelings about this.
>
>>
>> my naive solution would be to proxy the rrd requests to the
>> pve nodes directly, but then why would we need the metric
>> collection in the first place?
>
> We also could trigger an out-of-schedule metric collection for a remote
> when the RRD graph calls the rrddata endpoint (the functions for
> triggering the collection of a single remote are already there, albeit
> non-blocking, so this would need some changes). Fetching the missing
> data for a single remote should be fast enough any way. The rrddata
> endpoint could have some timeout for waiting for the results of
> collecting that single remote; if that one is exceeded we don't wait
> until the collection results are done but simply return the existing
> data.
>
The problem is actually just the 'hourly' graphs, since there the
interval is so small one noticed the 10 minutes quite often.
All others (daily,monthly, etc.) don't exhibit the same problem
so i'd say if we can fetch missing data on-the-fly in the api call
for hourly calls, it would be good enough
>>
>> i guess i'm just missing something here, maybe you can point it out for me?
>
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (23 preceding siblings ...)
2025-08-21 12:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Dominik Csapak
@ 2025-08-22 11:51 ` Dominik Csapak
2025-08-22 12:49 ` Dominik Csapak
2025-08-25 8:43 ` Lukas Wagner
2025-08-26 13:53 ` [pdm-devel] superseded: " Lukas Wagner
25 siblings, 2 replies; 32+ messages in thread
From: Dominik Csapak @ 2025-08-22 11:51 UTC (permalink / raw)
To: Proxmox Datacenter Manager development discussion, Lukas Wagner
aside from the issue with the last 10 minute gap of data in the hourly
view, code LGTM and tested fine, did not notice anything else off
Some patches (especially the last 3) could possibly be squashed in to
earlier patches (no biggie though)
Since this is a rather large series, I'd like to see a reviewed-by from
someone else before applying, since the reviewed-by from maximiliano
came in at v2 i think, and there were some noticeable changes since
then.
With that said, consider this round
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
2025-08-22 11:51 ` Dominik Csapak
@ 2025-08-22 12:49 ` Dominik Csapak
2025-08-25 8:43 ` Lukas Wagner
1 sibling, 0 replies; 32+ messages in thread
From: Dominik Csapak @ 2025-08-22 12:49 UTC (permalink / raw)
To: Proxmox Datacenter Manager development discussion, Lukas Wagner
ah one thing i forgot to mention:
i'd probably favor doing it the way you described here:
> - Should `GET /remotes/<remote>/metric-collection-rrddata` be
> just `rrddata`?
> not sure if we are going to add any other PDM-native per-remote
> metrics and whether we want to return that from the same API call
> as this...
i think it's fine to mix it here
also the /metric-collection/rrddata could be merged
with a (not yet implemented) rrddata endpoint in the /nodes/localhost path
i guess we want to expose some pdm host rrddata sooner or later anyway
there we can include this too?
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
2025-08-22 11:51 ` Dominik Csapak
2025-08-22 12:49 ` Dominik Csapak
@ 2025-08-25 8:43 ` Lukas Wagner
1 sibling, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-25 8:43 UTC (permalink / raw)
To: Dominik Csapak, Proxmox Datacenter Manager development discussion
On Fri Aug 22, 2025 at 1:51 PM CEST, Dominik Csapak wrote:
> aside from the issue with the last 10 minute gap of data in the hourly
> view, code LGTM and tested fine, did not notice anything else off
>
> Some patches (especially the last 3) could possibly be squashed in to
> earlier patches (no biggie though)
Will do, if I can do it without large conflicts while rebasing.
>
> Since this is a rather large series, I'd like to see a reviewed-by from
> someone else before applying, since the reviewed-by from maximiliano
> came in at v2 i think, and there were some noticeable changes since
> then.
>
> With that said, consider this round
>
> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
> Tested-by: Dominik Csapak <d.csapak@proxmox.com>
[copied from your other reponse]:
On Fri Aug 22, 2025 at 2:49 PM CEST, Dominik Csapak wrote:
> ah one thing i forgot to mention:
>
> i'd probably favor doing it the way you described here:
>
>
> > - Should `GET /remotes/<remote>/metric-collection-rrddata` be
> > just `rrddata`?
> > not sure if we are going to add any other PDM-native per-remote
> > metrics and whether we want to return that from the same API call
> > as this...
>
> i think it's fine to mix it here
Alright, I'll rename it to just `rrddata` then.
>
> also the /metric-collection/rrddata could be merged
> with a (not yet implemented) rrddata endpoint in the /nodes/localhost path
Will do!
>
> i guess we want to expose some pdm host rrddata sooner or later anyway
> there we can include this too?
[copied from your other response]:
On Fri Aug 22, 2025 at 1:27 PM CEST, Dominik Csapak wrote:
>> We also could trigger an out-of-schedule metric collection for a remote
>> when the RRD graph calls the rrddata endpoint (the functions for
>> triggering the collection of a single remote are already there, albeit
>> non-blocking, so this would need some changes). Fetching the missing
>> data for a single remote should be fast enough any way. The rrddata
>> endpoint could have some timeout for waiting for the results of
>> collecting that single remote; if that one is exceeded we don't wait
>> until the collection results are done but simply return the existing
>> data.
>>
>
> The problem is actually just the 'hourly' graphs, since there the
> interval is so small one noticed the 10 minutes quite often.
>
> All others (daily,monthly, etc.) don't exhibit the same problem
>
> so i'd say if we can fetch missing data on-the-fly in the api call
> for hourly calls, it would be good enough
Alright, will try to implement something akin to what I've described for
v7.
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* [pdm-devel] superseded: [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (24 preceding siblings ...)
2025-08-22 11:51 ` Dominik Csapak
@ 2025-08-26 13:53 ` Lukas Wagner
25 siblings, 0 replies; 32+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:53 UTC (permalink / raw)
To: Proxmox Datacenter Manager development discussion
superseded by:
https://lore.proxmox.com/pdm-devel/20250826135119.336510-1-l.wagner@proxmox.com/T/#t
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI)
2025-08-21 12:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Dominik Csapak
2025-08-21 13:46 ` Lukas Wagner
@ 2025-08-27 7:19 ` Thomas Lamprecht
1 sibling, 0 replies; 32+ messages in thread
From: Thomas Lamprecht @ 2025-08-27 7:19 UTC (permalink / raw)
To: Proxmox Datacenter Manager development discussion,
Dominik Csapak, Lukas Wagner
On 21/08/2025 14:54, Dominik Csapak wrote:
> just a first high level question while i'm testing/reviewing this:
>
> when using this patch, it seems the collection interval is much reduced?
>
> e.g. in the gui I'm missing up to the last 10 minutes now?
> (at 14:49 the last point i have for the rrd is from 14:40)
>
> is this by design? i get that we don't want to pull too often,
> but showing up to 10minutes out of date graphs is also not practical?
Yes this was by design, or well indeed mine and also Dietmars request.
But surely something that might warrant a central knob with a per
remote override in the future (or smarter heuristics depending on
remote node count and latency, i.e. how long the last few metric
queries took on average).
But the idea was also to fetch missing data on demand if required, e.g.
when a user visits a specific resource with a time frame length where
this is noticeable, like you mentioned that it mostly sticks out for
the last-hour time frame.
> my naive solution would be to proxy the rrd requests to the
> pve nodes directly, but then why would we need the metric
> collection in the first place?
Those two are a bit orthogonal to each other, we want RRD collection
to not depend on every remote having to be online all the time (to see
when it went offline, to avoid high latency to add many delays, ...), and
being able to do quick metrics and stats computations on all resources
(which will never be 100% real time anyway). The other is seeing most
recent data, e.g. when investigating why something is slow or what not.
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2025-08-27 7:20 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-21 9:52 [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 01/23] metric collection: split top_entities split into separate module Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 02/23] metric collection: save metric data to RRD in separate task Lukas Wagner
2025-08-21 9:52 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 03/23] metric collection: rework metric poll task Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 04/23] metric collection: persist state after metric collection Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 05/23] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 06/23] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 07/23] metric collection: add tests for the fetch_remotes function Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 08/23] metric collection: add test for fetch_overdue Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 09/23] metric collection: pass rrd cache instance as function parameter Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 10/23] metric collection: add test for rrd task Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 11/23] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 12/23] metric collection: record remote response time in metric database Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 13/23] metric collection: save time needed for collection run to RRD Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 14/23] metric collection: periodically clean removed remotes from statefile Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 15/23] api: add endpoint to trigger metric collection Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 16/23] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 17/23] api: add api for querying metric collection RRD data Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 18/23] api: metric-collection: add status endpoint Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 19/23] pdm-client: add metric collection API methods Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 20/23] cli: add commands for metric-collection trigger and status Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 21/23] metric collection: factor out handle_tick and handle_control_message fns Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 22/23] metric collection: skip missed timer ticks Lukas Wagner
2025-08-21 9:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 23/23] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
2025-08-21 12:53 ` [pdm-devel] [PATCH proxmox-datacenter-manager v6 00/23] metric collection improvements (concurrency, API, CLI) Dominik Csapak
2025-08-21 13:46 ` Lukas Wagner
2025-08-22 11:27 ` Dominik Csapak
2025-08-27 7:19 ` Thomas Lamprecht
2025-08-22 11:51 ` Dominik Csapak
2025-08-22 12:49 ` Dominik Csapak
2025-08-25 8:43 ` Lukas Wagner
2025-08-26 13:53 ` [pdm-devel] superseded: " Lukas Wagner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox