* [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI)
@ 2025-08-26 13:50 Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 01/24] metric collection: split top_entities split into separate module Lukas Wagner
` (24 more replies)
0 siblings, 25 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:50 UTC (permalink / raw)
To: pdm-devel
Key points:
- fetch metrics concurrently
- Add some tests for the core logic in the metric collection system
- Allow to trigger metric collection via the API
- Record metric collection statistics in the RRD
- overall collection time for all remotes
- per remote response time when fetching metrics
- Persist metric collection state to disk:
/var/lib/proxmox-datacenter-manager/metric-collection-state.json
(timestamps of last collection, errors)
- Trigger metric collection for any new remotes added via the API
- Add new API endpoints
POST /metric-collection/trigger with optional 'remote' param
GET /metric-collection/status
GET /remotes/<remote>/rrddata
GET /nodes/localhost/rrddata
- Add CLI tooling
proxmox-datacenter-client metric-collection trigger [--remote <remote>]
proxmox-datacenter-client metric-collection status
## Potential future work
- UI button for triggering metric collection
- Make collection interval configurable
- Show RRD graphs for metric collection stats somewhere
- Have some global concurrency control knob for background
requests [request scheduling].
Changes since [v6]:
- Changed API paths for RRD data
/nodes/localhost/rrdata includes the total time needed for metric collection
/remotes/{remote}/rrddata includes the api-response time for collection a single remote
- Request latest metric for a single remote when requesting 'hourly' RRD stats
for remote nodes, VMs, CTs
- Folded in some fixup patches at the end (not all, some led to too many conflicts)
Changes since [v5]:
- Rebased onto latest master
Changes since [v4]:
- Drop metric collection config file for now -
these might better be stored together with config for other
background tasks
Changes since [v3]:
- Rebase onto master
- Fix a couple clippy warnings (CreateOptions is now Copy!)
Changes since [v2]:
- For now, drop settings that might change any way with a
global background request scheduling system [request scheduling]:
- max-concurrency
- {min,max}-interval-offset
- {min,max}-connection-delay
Changes since [v1]:
- add missing dependency to librust-rand-dev to d/control
- Fix a couple of minor spelling/punctuation issues (thx maximiliano)
- Some minor code style improvments, e.g. using unwrap_or_else instead
of doing a manual match
- Document return values of 'setup_timer' function
- Factor out handle_tick/handle_control_message
- Minor refatoring/code style improvments
- CLI: Change 'update-settings' to 'settings update'
- CLI: Change 'show-settings' to 'settings show'
- change missed tick behavior for tokio::time::Interval to 'skip'
instead of burst.
The last three commits are new in v2.
[v1]: https://lore.proxmox.com/pdm-devel/20250211120541.163621-1-l.wagner@proxmox.com/T/#t
[v2]: https://lore.proxmox.com/pdm-devel/20250214130653.283012-1-l.wagner@proxmox.com/
[v3]: https://lore.proxmox.com/pdm-devel/20250416125642.291552-1-l.wagner@proxmox.com/T/#t
[v4]: https://lore.proxmox.com/pdm-devel/20250512133725.262263-1-l.wagner@proxmox.com/T/#t
[v5]: https://lore.proxmox.com/pdm-devel/c1a8deae-9590-471c-8505-d3e799bc7125@proxmox.com/T/#t
[request scheduling]: https://lore.proxmox.com/pdm-devel/7b3e90c8-6ebb-400f-acf9-cac084cc39fe@proxmox.com/
proxmox-datacenter-manager:
Lukas Wagner (24):
metric collection: split top_entities split into separate module
metric collection: save metric data to RRD in separate task
metric collection: rework metric poll task
metric collection: persist state after metric collection
metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
metric collection: collect overdue metrics on startup/timer change
metric collection: add tests for the fetch_remotes function
metric collection: add test for fetch_overdue
metric collection: pass rrd cache instance as function parameter
metric collection: add test for rrd task
metric collection: wrap rrd_cache::Cache in a struct
metric collection: record remote response time in metric database
metric collection: save time needed for collection run to RRD
metric collection: periodically clean removed remotes from statefile
api: add endpoint to trigger metric collection
api: remotes: trigger immediate metric collection for newly added
nodes
api: add api for querying metric collection RRD data
api: metric-collection: add status endpoint
pdm-client: add metric collection API methods
cli: add commands for metric-collection trigger and status
metric collection: skip missed timer ticks
metric collection: use JoinSet instead of joining from handles in a
Vec
metric collection: allow to wait until completion when triggering
collection manually
api: pve: rrd: trigger and wait for metric collection when requesting
RRD data
cli/client/Cargo.toml | 1 +
cli/client/src/main.rs | 2 +
cli/client/src/metric_collection.rs | 70 ++
lib/pdm-api-types/src/lib.rs | 3 +
lib/pdm-api-types/src/metric_collection.rs | 20 +
lib/pdm-api-types/src/rrddata.rs | 26 +
lib/pdm-client/src/lib.rs | 56 ++
server/src/api/metric_collection.rs | 46 ++
server/src/api/mod.rs | 2 +
server/src/api/nodes/mod.rs | 2 +
server/src/api/nodes/rrddata.rs | 48 ++
server/src/api/pve/rrddata.rs | 43 +-
server/src/api/remotes.rs | 59 ++
server/src/api/resources.rs | 3 +-
server/src/api/rrd_common.rs | 11 +-
server/src/bin/proxmox-datacenter-api/main.rs | 2 +-
.../src/metric_collection/collection_task.rs | 661 ++++++++++++++++++
server/src/metric_collection/mod.rs | 346 +++------
server/src/metric_collection/rrd_cache.rs | 206 +++---
server/src/metric_collection/rrd_task.rs | 289 ++++++++
server/src/metric_collection/state.rs | 150 ++++
server/src/metric_collection/top_entities.rs | 150 ++++
22 files changed, 1817 insertions(+), 379 deletions(-)
create mode 100644 cli/client/src/metric_collection.rs
create mode 100644 lib/pdm-api-types/src/metric_collection.rs
create mode 100644 server/src/api/metric_collection.rs
create mode 100644 server/src/api/nodes/rrddata.rs
create mode 100644 server/src/metric_collection/collection_task.rs
create mode 100644 server/src/metric_collection/rrd_task.rs
create mode 100644 server/src/metric_collection/state.rs
create mode 100644 server/src/metric_collection/top_entities.rs
Summary over all repositories:
22 files changed, 1817 insertions(+), 379 deletions(-)
--
Generated by murpp 0.9.0
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 01/24] metric collection: split top_entities split into separate module
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
@ 2025-08-26 13:50 ` Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 02/24] metric collection: save metric data to RRD in separate task Lukas Wagner
` (23 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:50 UTC (permalink / raw)
To: pdm-devel
This makes the parent module a bit more easy not navigate.
No functional changes intended.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/api/resources.rs | 3 +-
server/src/metric_collection/mod.rs | 149 +------------------
server/src/metric_collection/top_entities.rs | 148 ++++++++++++++++++
3 files changed, 152 insertions(+), 148 deletions(-)
create mode 100644 server/src/metric_collection/top_entities.rs
diff --git a/server/src/api/resources.rs b/server/src/api/resources.rs
index 6a8c8ef2..13a959a6 100644
--- a/server/src/api/resources.rs
+++ b/server/src/api/resources.rs
@@ -26,6 +26,7 @@ use proxmox_subscription::SubscriptionStatus;
use pve_api_types::{ClusterResource, ClusterResourceType};
use crate::connection;
+use crate::metric_collection::top_entities;
pub const ROUTER: Router = Router::new()
.get(&list_subdirs_api_method!(SUBDIRS))
@@ -320,7 +321,7 @@ async fn get_top_entities(timeframe: Option<RrdTimeframe>) -> Result<TopEntities
let (remotes_config, _) = pdm_config::remotes::config()?;
let timeframe = timeframe.unwrap_or(RrdTimeframe::Day);
- let res = crate::metric_collection::calculate_top(&remotes_config, timeframe, 10);
+ let res = top_entities::calculate_top(&remotes_config, timeframe, 10);
Ok(res)
}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index f62e2a63..588f7ad5 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -6,15 +6,13 @@ use anyhow::Error;
use pbs_api_types::{MetricDataPoint, MetricDataType};
use proxmox_rrd::rrd::DataSourceType;
-use pdm_api_types::{
- remotes::RemoteType,
- resource::{Resource, ResourceRrdData, TopEntities, TopEntity},
-};
+use pdm_api_types::remotes::RemoteType;
use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
use crate::{connection, task_utils};
pub mod rrd_cache;
+pub mod top_entities;
const COLLECTION_INTERVAL: u64 = 60;
@@ -150,146 +148,3 @@ fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
data_source_type,
);
}
-
-fn insert_sorted<T>(vec: &mut Vec<(usize, T)>, value: (usize, T), limit: usize) {
- let index = match vec.binary_search_by_key(&value.0, |(idx, _)| *idx) {
- Ok(idx) | Err(idx) => idx,
- };
-
- vec.insert(index, value);
- if vec.len() > limit {
- for _ in 0..(vec.len() - limit) {
- vec.remove(0);
- }
- }
-}
-
-// for now simple sum of the values => area under the graph curve
-fn calculate_coefficient(values: &proxmox_rrd::Entry) -> f64 {
- let mut coefficient = 0.0;
- for point in values.data.iter() {
- let value = point.unwrap_or_default();
- if value.is_finite() {
- coefficient += value;
- }
- }
-
- coefficient
-}
-
-// FIXME: cache the values instead of calculate freshly every time?
-// FIXME: find better way to enumerate nodes/guests/etc.(instead of relying on the cache)
-pub fn calculate_top(
- remotes: &HashMap<String, pdm_api_types::remotes::Remote>,
- timeframe: proxmox_rrd_api_types::RrdTimeframe,
- num: usize,
-) -> TopEntities {
- let mut guest_cpu = Vec::new();
- let mut node_cpu = Vec::new();
- let mut node_memory = Vec::new();
-
- for remote_name in remotes.keys() {
- if let Some(data) =
- crate::api::resources::get_cached_resources(remote_name, i64::MAX as u64)
- {
- for res in data.resources {
- let id = res.id().to_string();
- let name = format!("pve/{remote_name}/{id}");
- match &res {
- Resource::PveStorage(_) => {}
- Resource::PveQemu(_) | Resource::PveLxc(_) => {
- if let Some(entity) =
- get_entity(timeframe, remote_name, res, name, "cpu_current")
- {
- let coefficient = (entity.0 * 100.0).round() as usize;
- insert_sorted(&mut guest_cpu, (coefficient, entity.1), num);
- }
- }
- Resource::PveNode(_) => {
- if let Some(entity) = get_entity(
- timeframe,
- remote_name,
- res.clone(),
- name.clone(),
- "cpu_current",
- ) {
- let coefficient = (entity.0 * 100.0).round() as usize;
- insert_sorted(&mut node_cpu, (coefficient, entity.1), num);
- }
- // convert mem/mem_total into a single entity
- if let Some(mut mem) = get_entity(
- timeframe,
- remote_name,
- res.clone(),
- name.clone(),
- "mem_used",
- ) {
- if let Some(mem_total) =
- get_entity(timeframe, remote_name, res, name, "mem_total")
- {
- // skip if we don't have the same amount of data for used and total
- let mem_rrd = &mem.1.rrd_data.data;
- let mem_total_rrd = &mem_total.1.rrd_data.data;
- if mem_rrd.len() != mem_total_rrd.len() {
- continue;
- }
- let coefficient = (100.0 * mem.0 / mem_total.0).round() as usize;
- let mut mem_usage = Vec::new();
- for i in 0..mem_rrd.len() {
- let point = match (mem_rrd[i], mem_total_rrd[i]) {
- (Some(mem), Some(total)) => Some(mem / total),
- _ => None,
- };
- mem_usage.push(point)
- }
- mem.1.rrd_data.data = mem_usage;
- insert_sorted(&mut node_memory, (coefficient, mem.1), num);
- }
- }
- }
- Resource::PbsNode(_) => {}
- Resource::PbsDatastore(_) => {}
- }
- }
- }
- }
-
- TopEntities {
- guest_cpu: guest_cpu.into_iter().map(|(_, entity)| entity).collect(),
- node_cpu: node_cpu.into_iter().map(|(_, entity)| entity).collect(),
- node_memory: node_memory.into_iter().map(|(_, entity)| entity).collect(),
- }
-}
-
-fn get_entity(
- timeframe: proxmox_rrd_api_types::RrdTimeframe,
- remote_name: &String,
- res: Resource,
- name: String,
- metric: &str,
-) -> Option<(f64, TopEntity)> {
- if let Ok(Some(values)) = rrd_cache::extract_data(
- &name,
- metric,
- timeframe,
- proxmox_rrd_api_types::RrdMode::Average,
- ) {
- let coefficient = calculate_coefficient(&values);
- if coefficient > 0.0 {
- return Some((
- coefficient,
- TopEntity {
- remote: remote_name.to_string(),
- resource: res,
- rrd_data: ResourceRrdData {
- start: values.start,
- resolution: values.resolution,
- data: values.data,
- },
- },
- ));
- }
- }
-
- None
-}
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
new file mode 100644
index 00000000..f8e053fb
--- /dev/null
+++ b/server/src/metric_collection/top_entities.rs
@@ -0,0 +1,148 @@
+use std::collections::HashMap;
+
+use pdm_api_types::resource::{Resource, ResourceRrdData, TopEntities, TopEntity};
+
+use super::rrd_cache;
+
+fn insert_sorted<T>(vec: &mut Vec<(usize, T)>, value: (usize, T), limit: usize) {
+ let index = match vec.binary_search_by_key(&value.0, |(idx, _)| *idx) {
+ Ok(idx) | Err(idx) => idx,
+ };
+
+ vec.insert(index, value);
+ if vec.len() > limit {
+ for _ in 0..(vec.len() - limit) {
+ vec.remove(0);
+ }
+ }
+}
+
+// for now simple sum of the values => area under the graph curve
+fn calculate_coefficient(values: &proxmox_rrd::Entry) -> f64 {
+ let mut coefficient = 0.0;
+ for point in values.data.iter() {
+ let value = point.unwrap_or_default();
+ if value.is_finite() {
+ coefficient += value;
+ }
+ }
+
+ coefficient
+}
+
+// FIXME: cache the values instead of calculate freshly every time?
+// FIXME: find better way to enumerate nodes/guests/etc.(instead of relying on the cache)
+pub fn calculate_top(
+ remotes: &HashMap<String, pdm_api_types::remotes::Remote>,
+ timeframe: proxmox_rrd_api_types::RrdTimeframe,
+ num: usize,
+) -> TopEntities {
+ let mut guest_cpu = Vec::new();
+ let mut node_cpu = Vec::new();
+ let mut node_memory = Vec::new();
+
+ for remote_name in remotes.keys() {
+ if let Some(data) =
+ crate::api::resources::get_cached_resources(remote_name, i64::MAX as u64)
+ {
+ for res in data.resources {
+ let id = res.id().to_string();
+ let name = format!("pve/{remote_name}/{id}");
+ match &res {
+ Resource::PveStorage(_) => {}
+ Resource::PveQemu(_) | Resource::PveLxc(_) => {
+ if let Some(entity) =
+ get_entity(timeframe, remote_name, res, name, "cpu_current")
+ {
+ let coefficient = (entity.0 * 100.0).round() as usize;
+ insert_sorted(&mut guest_cpu, (coefficient, entity.1), num);
+ }
+ }
+ Resource::PveNode(_) => {
+ if let Some(entity) = get_entity(
+ timeframe,
+ remote_name,
+ res.clone(),
+ name.clone(),
+ "cpu_current",
+ ) {
+ let coefficient = (entity.0 * 100.0).round() as usize;
+ insert_sorted(&mut node_cpu, (coefficient, entity.1), num);
+ }
+ // convert mem/mem_total into a single entity
+ if let Some(mut mem) = get_entity(
+ timeframe,
+ remote_name,
+ res.clone(),
+ name.clone(),
+ "mem_used",
+ ) {
+ if let Some(mem_total) =
+ get_entity(timeframe, remote_name, res, name, "mem_total")
+ {
+ // skip if we don't have the same amount of data for used and total
+ let mem_rrd = &mem.1.rrd_data.data;
+ let mem_total_rrd = &mem_total.1.rrd_data.data;
+ if mem_rrd.len() != mem_total_rrd.len() {
+ continue;
+ }
+ let coefficient = (100.0 * mem.0 / mem_total.0).round() as usize;
+ let mut mem_usage = Vec::new();
+ for i in 0..mem_rrd.len() {
+ let point = match (mem_rrd[i], mem_total_rrd[i]) {
+ (Some(mem), Some(total)) => Some(mem / total),
+ _ => None,
+ };
+ mem_usage.push(point)
+ }
+ mem.1.rrd_data.data = mem_usage;
+ insert_sorted(&mut node_memory, (coefficient, mem.1), num);
+ }
+ }
+ }
+ Resource::PbsNode(_) => {}
+ Resource::PbsDatastore(_) => {}
+ }
+ }
+ }
+ }
+
+ TopEntities {
+ guest_cpu: guest_cpu.into_iter().map(|(_, entity)| entity).collect(),
+ node_cpu: node_cpu.into_iter().map(|(_, entity)| entity).collect(),
+ node_memory: node_memory.into_iter().map(|(_, entity)| entity).collect(),
+ }
+}
+
+fn get_entity(
+ timeframe: proxmox_rrd_api_types::RrdTimeframe,
+ remote_name: &String,
+ res: Resource,
+ name: String,
+ metric: &str,
+) -> Option<(f64, TopEntity)> {
+ if let Ok(Some(values)) = rrd_cache::extract_data(
+ &name,
+ metric,
+ timeframe,
+ proxmox_rrd_api_types::RrdMode::Average,
+ ) {
+ let coefficient = calculate_coefficient(&values);
+ if coefficient > 0.0 {
+ return Some((
+ coefficient,
+ TopEntity {
+ remote: remote_name.to_string(),
+ resource: res,
+ rrd_data: ResourceRrdData {
+ start: values.start,
+ resolution: values.resolution,
+ data: values.data,
+ },
+ },
+ ));
+ }
+ }
+
+ None
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 02/24] metric collection: save metric data to RRD in separate task
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 01/24] metric collection: split top_entities split into separate module Lukas Wagner
@ 2025-08-26 13:50 ` Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 03/24] metric collection: rework metric poll task Lukas Wagner
` (22 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:50 UTC (permalink / raw)
To: pdm-devel
This is a preparation for concurrent metric collection. While the RRD
cache appears to be safe to concurrent access (it uses an RwLock to
protect the data), delegating all RRD writes to a separate tokio task
makes it IMO easier to reason about and understand. Furthermore, it
decouples the 'metric fetching' and 'metric
storing' parts, making it much easier to write tests for both parts
independently.
For better code separation, this also splits out the new task into
a new submodule.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/metric_collection/mod.rs | 102 +++++++----------------
server/src/metric_collection/rrd_task.rs | 96 +++++++++++++++++++++
2 files changed, 128 insertions(+), 70 deletions(-)
create mode 100644 server/src/metric_collection/rrd_task.rs
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 588f7ad5..9c75f39c 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -2,18 +2,18 @@ use std::collections::HashMap;
use std::pin::pin;
use anyhow::Error;
-
-use pbs_api_types::{MetricDataPoint, MetricDataType};
-use proxmox_rrd::rrd::DataSourceType;
+use tokio::sync::mpsc::{self, Sender};
use pdm_api_types::remotes::RemoteType;
-use pve_api_types::{ClusterMetricsData, ClusterMetricsDataType};
use crate::{connection, task_utils};
pub mod rrd_cache;
+mod rrd_task;
pub mod top_entities;
+use rrd_task::RrdStoreRequest;
+
const COLLECTION_INTERVAL: u64 = 60;
/// Initialize the RRD cache
@@ -25,14 +25,22 @@ pub fn init() -> Result<(), Error> {
/// Start the metric collection task.
pub fn start_task() {
+ let (tx, rx) = mpsc::channel(128);
+
tokio::spawn(async move {
- let task_scheduler = pin!(metric_collection_task());
+ let task_scheduler = pin!(metric_collection_task(tx));
+ let abort_future = pin!(proxmox_daemon::shutdown_future());
+ futures::future::select(task_scheduler, abort_future).await;
+ });
+
+ tokio::spawn(async move {
+ let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
let abort_future = pin!(proxmox_daemon::shutdown_future());
futures::future::select(task_scheduler, abort_future).await;
});
}
-async fn metric_collection_task() -> Result<(), Error> {
+async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
loop {
@@ -59,37 +67,33 @@ async fn metric_collection_task() -> Result<(), Error> {
.cluster_metrics_export(Some(true), Some(false), Some(start_time))
.await?;
- //// Involves some blocking file IO
- tokio::task::spawn_blocking(move || {
- let mut most_recent_timestamp = 0;
+ let most_recent =
+ metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
- for data_point in metrics.data {
- most_recent_timestamp =
- most_recent_timestamp.max(data_point.timestamp);
- store_metric_pve(&remote_name_clone, &data_point);
- }
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote_name_clone,
+ metrics,
+ })
+ .await?;
- most_recent_timestamp
- })
- .await
+ Ok::<i64, Error>(most_recent)
}
RemoteType::Pbs => {
let client = connection::make_pbs_client(&remote)?;
let metrics = client.metrics(Some(true), Some(start_time)).await?;
- // Involves some blocking file IO
- tokio::task::spawn_blocking(move || {
- let mut most_recent_timestamp = 0;
+ let most_recent =
+ metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
- for data_point in metrics.data {
- most_recent_timestamp =
- most_recent_timestamp.max(data_point.timestamp);
- store_metric_pbs(&remote_name_clone, &data_point);
- }
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote_name_clone,
+ metrics,
+ })
+ .await?;
- most_recent_timestamp
- })
- .await
+ Ok::<i64, Error>(most_recent)
}
}?;
@@ -106,45 +110,3 @@ async fn metric_collection_task() -> Result<(), Error> {
}
}
}
-
-fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
- let name = format!(
- "pve/{remote_name}/{id}/{metric}",
- id = data_point.id,
- metric = data_point.metric,
- );
-
- let data_source_type = match data_point.ty {
- ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
- ClusterMetricsDataType::Counter => DataSourceType::Counter,
- ClusterMetricsDataType::Derive => DataSourceType::Derive,
- };
-
- rrd_cache::update_value(
- &name,
- data_point.value,
- data_point.timestamp,
- data_source_type,
- );
-}
-
-fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
- let name = format!(
- "pbs/{remote_name}/{id}/{metric}",
- id = data_point.id,
- metric = data_point.metric,
- );
-
- let data_source_type = match data_point.ty {
- MetricDataType::Gauge => DataSourceType::Gauge,
- MetricDataType::Counter => DataSourceType::Counter,
- MetricDataType::Derive => DataSourceType::Derive,
- };
-
- rrd_cache::update_value(
- &name,
- data_point.value,
- data_point.timestamp,
- data_source_type,
- );
-}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
new file mode 100644
index 00000000..a72945df
--- /dev/null
+++ b/server/src/metric_collection/rrd_task.rs
@@ -0,0 +1,96 @@
+use anyhow::Error;
+use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
+use proxmox_rrd::rrd::DataSourceType;
+use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
+use tokio::sync::mpsc::Receiver;
+
+use super::rrd_cache;
+
+/// Store request for the RRD task.
+pub(super) enum RrdStoreRequest {
+ /// Store PVE metrics.
+ Pve {
+ /// Remote name.
+ remote: String,
+ /// Metric data.
+ metrics: ClusterMetrics,
+ },
+ /// Store PBS metrics.
+ Pbs {
+ /// Remote name.
+ remote: String,
+ /// Metric data.
+ metrics: Metrics,
+ },
+}
+
+/// Task which stores received metrics in the RRD. Metric data is fed into
+/// this task via a MPSC channel.
+pub(super) async fn store_in_rrd_task(
+ mut receiver: Receiver<RrdStoreRequest>,
+) -> Result<(), Error> {
+ while let Some(msg) = receiver.recv().await {
+ // Involves some blocking file IO
+ let res = tokio::task::spawn_blocking(move || match msg {
+ RrdStoreRequest::Pve { remote, metrics } => {
+ for data_point in metrics.data {
+ store_metric_pve(&remote, &data_point);
+ }
+ }
+ RrdStoreRequest::Pbs { remote, metrics } => {
+ for data_point in metrics.data {
+ store_metric_pbs(&remote, &data_point);
+ }
+ }
+ })
+ .await;
+
+ if let Err(err) = res {
+ log::error!("error in rrd task when attempting to save metrics: {err}");
+ }
+ }
+
+ Ok(())
+}
+
+fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
+ let name = format!(
+ "pve/{remote_name}/{id}/{metric}",
+ id = data_point.id,
+ metric = data_point.metric,
+ );
+
+ let data_source_type = match data_point.ty {
+ ClusterMetricsDataType::Gauge => DataSourceType::Gauge,
+ ClusterMetricsDataType::Counter => DataSourceType::Counter,
+ ClusterMetricsDataType::Derive => DataSourceType::Derive,
+ };
+
+ rrd_cache::update_value(
+ &name,
+ data_point.value,
+ data_point.timestamp,
+ data_source_type,
+ );
+}
+
+fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
+ let name = format!(
+ "pbs/{remote_name}/{id}/{metric}",
+ id = data_point.id,
+ metric = data_point.metric,
+ );
+
+ let data_source_type = match data_point.ty {
+ MetricDataType::Gauge => DataSourceType::Gauge,
+ MetricDataType::Counter => DataSourceType::Counter,
+ MetricDataType::Derive => DataSourceType::Derive,
+ };
+
+ rrd_cache::update_value(
+ &name,
+ data_point.value,
+ data_point.timestamp,
+ data_source_type,
+ );
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 03/24] metric collection: rework metric poll task
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 01/24] metric collection: split top_entities split into separate module Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 02/24] metric collection: save metric data to RRD in separate task Lukas Wagner
@ 2025-08-26 13:50 ` Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 04/24] metric collection: persist state after metric collection Lukas Wagner
` (21 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:50 UTC (permalink / raw)
To: pdm-devel
Metric collection is mostly limited by network latency, not disk IO or
CPU. This means we should be able get significant speedups by polling
remotes concurrently by spawning tokio workers.
To avoid load/IO spikes, we limit the number of concurrent connections
using a semaphore.
Furthermore, we add a mechanism to trigger metric collection manually.
This is achieve by using `tokio::select!` on both, the timer's `tick`
method and the `recv` method of an `mpsc` which is used to send control
messages to the metric collection task.
For better code structure, the collection task is split into a separate
module.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
Notes:
Changes since v1:
- get_settings_or_default function: use unwrap_or_default instead
of match
- Add `rand` crate dependency to d/control
- Refactor sleep helpers, consolidate into a single
parameterised funtion
- Get rid of .unwrap when spawning up the metric collection task
- Let trigger_metric_collection_for_remote take a String instead of
&str
Changes since v2:
- Use hardcoded default for max-concurrency for now. This
might be replaced by some global knob, affecting all background tasks
- Drop interval delays/jitter for now, this also might be affected
by a global request scheduler
Changes since v6:
- Fold in later commit, splitting out the process_tick and
process_control_message functions
server/src/api/metric_collection.rs | 46 ++++
server/src/bin/proxmox-datacenter-api/main.rs | 2 +-
.../src/metric_collection/collection_task.rs | 227 ++++++++++++++++++
server/src/metric_collection/mod.rs | 135 ++++-------
4 files changed, 318 insertions(+), 92 deletions(-)
create mode 100644 server/src/api/metric_collection.rs
create mode 100644 server/src/metric_collection/collection_task.rs
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
new file mode 100644
index 00000000..0658fb1f
--- /dev/null
+++ b/server/src/api/metric_collection.rs
@@ -0,0 +1,46 @@
+use anyhow::Error;
+
+use proxmox_router::{Router, SubdirMap};
+use proxmox_schema::api;
+use proxmox_sortable_macro::sortable;
+
+use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, MetricCollectionStatus};
+
+use crate::metric_collection;
+
+pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
+
+#[sortable]
+const SUBDIRS: SubdirMap = &sorted!([
+ (
+ "trigger",
+ &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+ ),
+ (
+ "status",
+ &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_STATUS)
+ ),
+]);
+
+#[api(
+ input: {
+ properties: {
+ remote: {
+ schema: REMOTE_ID_SCHEMA,
+ optional: true,
+ },
+ },
+ },
+)]
+/// Trigger metric collection for a provided remote or for all remotes if no remote is passed.
+pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+ crate::metric_collection::trigger_metric_collection(remote, false).await?;
+
+ Ok(())
+}
+
+#[api]
+/// Read metric collection status.
+fn get_metric_collection_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+ metric_collection::get_status()
+}
diff --git a/server/src/bin/proxmox-datacenter-api/main.rs b/server/src/bin/proxmox-datacenter-api/main.rs
index 42bc0e1e..c0f0e3a4 100644
--- a/server/src/bin/proxmox-datacenter-api/main.rs
+++ b/server/src/bin/proxmox-datacenter-api/main.rs
@@ -373,7 +373,7 @@ async fn run(debug: bool) -> Result<(), Error> {
});
start_task_scheduler();
- metric_collection::start_task();
+ metric_collection::start_task()?;
tasks::remote_node_mapping::start_task();
resource_cache::start_task();
tasks::remote_tasks::start_task()?;
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
new file mode 100644
index 00000000..a0dce8fd
--- /dev/null
+++ b/server/src/metric_collection/collection_task.rs
@@ -0,0 +1,227 @@
+use std::{collections::HashMap, sync::Arc, time::Duration};
+
+use anyhow::Error;
+use tokio::{
+ sync::{
+ mpsc::{Receiver, Sender},
+ OwnedSemaphorePermit, Semaphore,
+ },
+ time::Interval,
+};
+
+use proxmox_section_config::typed::SectionConfigData;
+
+use pdm_api_types::remotes::{Remote, RemoteType};
+
+use crate::{connection, task_utils};
+
+use super::rrd_task::RrdStoreRequest;
+
+pub const MAX_CONCURRENT_CONNECTIONS: usize = 20;
+
+/// Default metric collection interval.
+pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
+
+/// Control messages for the metric collection task.
+pub(super) enum ControlMsg {
+ TriggerMetricCollection(Option<String>),
+}
+
+/// Task which periodically collects metrics from all remotes and stores
+/// them in the local metrics database.
+pub(super) struct MetricCollectionTask {
+ most_recent_timestamps: HashMap<String, i64>,
+ metric_data_tx: Sender<RrdStoreRequest>,
+ control_message_rx: Receiver<ControlMsg>,
+}
+
+impl MetricCollectionTask {
+ /// Create a new metric collection task.
+ pub(super) fn new(
+ metric_data_tx: Sender<RrdStoreRequest>,
+ control_message_rx: Receiver<ControlMsg>,
+ ) -> Result<Self, Error> {
+ Ok(Self {
+ most_recent_timestamps: HashMap::new(),
+ metric_data_tx,
+ control_message_rx,
+ })
+ }
+
+ /// Run the metric collection task.
+ ///
+ /// This function never returns.
+ #[tracing::instrument(skip_all, name = "metric_collection_task")]
+ pub(super) async fn run(&mut self) {
+ let mut timer = Self::setup_timer(DEFAULT_COLLECTION_INTERVAL);
+
+ log::debug!(
+ "metric collection starting up. Collection interval set to {} seconds.",
+ DEFAULT_COLLECTION_INTERVAL,
+ );
+
+ loop {
+ tokio::select! {
+ _ = timer.tick() => {
+ self.handle_tick().await;
+ }
+
+ Some(message) = self.control_message_rx.recv() => {
+ self.handle_control_message(message).await;
+ }
+ }
+ }
+ }
+
+ /// Handle a timer tick.
+ async fn handle_tick(&mut self) {
+ log::debug!("starting metric collection from all remotes - triggered by timer");
+
+ if let Some(remotes) = Self::load_remote_config() {
+ let to_fetch = remotes
+ .iter()
+ .map(|(name, _)| name.into())
+ .collect::<Vec<String>>();
+ self.fetch_remotes(&remotes, &to_fetch).await;
+ }
+ }
+
+ /// Handle a control message for force-triggered collection.
+ async fn handle_control_message(&mut self, message: ControlMsg) {
+ if let Some(remotes) = Self::load_remote_config() {
+ match message {
+ ControlMsg::TriggerMetricCollection(Some(remote)) => {
+ log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
+ self.fetch_remotes(&remotes, &[remote]).await;
+ }
+ ControlMsg::TriggerMetricCollection(None) => {
+ log::debug!("starting metric collection from all remotes - triggered by control message");
+ let to_fetch = remotes
+ .iter()
+ .map(|(name, _)| name.into())
+ .collect::<Vec<String>>();
+ self.fetch_remotes(&remotes, &to_fetch).await;
+ }
+ }
+ }
+ }
+
+ /// Set up a [`tokio::time::Interval`] instance with the provided interval.
+ /// The timer will be aligned, e.g. an interval of `60` will let the timer
+ /// fire at minute boundaries.
+ fn setup_timer(interval: u64) -> Interval {
+ let mut timer = tokio::time::interval(Duration::from_secs(interval));
+ let first_run = task_utils::next_aligned_instant(interval).into();
+ timer.reset_at(first_run);
+
+ timer
+ }
+
+ /// Convenience helper to load `remote.cfg`, logging the error
+ /// and returning `None` if the config could not be read.
+ fn load_remote_config() -> Option<SectionConfigData<Remote>> {
+ match pdm_config::remotes::config() {
+ Ok((remotes, _)) => Some(remotes),
+ Err(e) => {
+ log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
+ None
+ }
+ }
+ }
+
+ /// Fetch metric data from a provided list of remotes concurrently.
+ /// The maximum number of concurrent connections is determined by
+ /// `max_concurrent_connections` in the [`CollectionSettings`]
+ /// instance in `self`.
+ async fn fetch_remotes(
+ &mut self,
+ remote_config: &SectionConfigData<Remote>,
+ remotes_to_fetch: &[String],
+ ) {
+ let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT_CONNECTIONS));
+ let mut handles = Vec::new();
+
+ for remote_name in remotes_to_fetch {
+ let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
+
+ // unwrap is okay here, acquire_* will only fail if `close` has been
+ // called on the semaphore.
+ let permit = Arc::clone(&semaphore).acquire_owned().await.unwrap();
+
+ if let Some(remote) = remote_config.get(remote_name).cloned() {
+ log::debug!("fetching remote '{}'", remote.id);
+ let handle = tokio::spawn(Self::fetch_single_remote(
+ remote,
+ start_time,
+ self.metric_data_tx.clone(),
+ permit,
+ ));
+
+ handles.push((remote_name.clone(), handle));
+ }
+ }
+
+ for (remote_name, handle) in handles {
+ let res = handle.await;
+
+ match res {
+ Ok(Ok(ts)) => {
+ self.most_recent_timestamps
+ .insert(remote_name.to_string(), ts);
+ }
+ Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
+ Err(err) => {
+ log::error!(
+ "join error for metric collection task for remote {remote_name}: {err}"
+ )
+ }
+ }
+ }
+ }
+
+ /// Fetch a single remote.
+ #[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
+ async fn fetch_single_remote(
+ remote: Remote,
+ start_time: i64,
+ sender: Sender<RrdStoreRequest>,
+ _permit: OwnedSemaphorePermit,
+ ) -> Result<i64, Error> {
+ let most_recent_timestamp = match remote.ty {
+ RemoteType::Pve => {
+ let client = connection::make_pve_client(&remote)?;
+ let metrics = client
+ .cluster_metrics_export(Some(true), Some(false), Some(start_time))
+ .await?;
+
+ let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote.id.clone(),
+ metrics,
+ })
+ .await?;
+
+ most_recent
+ }
+ RemoteType::Pbs => {
+ let client = connection::make_pbs_client(&remote)?;
+ let metrics = client.metrics(Some(true), Some(start_time)).await?;
+
+ let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote.id.clone(),
+ metrics,
+ })
+ .await?;
+
+ most_recent
+ }
+ };
+
+ Ok(most_recent_timestamp)
+ }
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 9c75f39c..3d2df331 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -1,20 +1,17 @@
-use std::collections::HashMap;
use std::pin::pin;
+use std::sync::OnceLock;
-use anyhow::Error;
+use anyhow::{bail, Error};
use tokio::sync::mpsc::{self, Sender};
-use pdm_api_types::remotes::RemoteType;
-
-use crate::{connection, task_utils};
-
+mod collection_task;
pub mod rrd_cache;
mod rrd_task;
pub mod top_entities;
-use rrd_task::RrdStoreRequest;
+use collection_task::{ControlMsg, MetricCollectionTask};
-const COLLECTION_INTERVAL: u64 = 60;
+static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
/// Initialize the RRD cache
pub fn init() -> Result<(), Error> {
@@ -24,89 +21,45 @@ pub fn init() -> Result<(), Error> {
}
/// Start the metric collection task.
-pub fn start_task() {
- let (tx, rx) = mpsc::channel(128);
+pub fn start_task() -> Result<(), Error> {
+ let (metric_data_tx, metric_data_rx) = mpsc::channel(128);
- tokio::spawn(async move {
- let task_scheduler = pin!(metric_collection_task(tx));
- let abort_future = pin!(proxmox_daemon::shutdown_future());
- futures::future::select(task_scheduler, abort_future).await;
- });
-
- tokio::spawn(async move {
- let task_scheduler = pin!(rrd_task::store_in_rrd_task(rx));
- let abort_future = pin!(proxmox_daemon::shutdown_future());
- futures::future::select(task_scheduler, abort_future).await;
- });
-}
-
-async fn metric_collection_task(sender: Sender<RrdStoreRequest>) -> Result<(), Error> {
- let mut most_recent_timestamps: HashMap<String, i64> = HashMap::new();
-
- loop {
- let delay_target = task_utils::next_aligned_instant(COLLECTION_INTERVAL);
- tokio::time::sleep_until(tokio::time::Instant::from_std(delay_target)).await;
-
- let remotes = match pdm_config::remotes::config() {
- Ok((remotes, _)) => remotes,
- Err(e) => {
- log::error!("failed to collect metrics, could not read remotes.cfg: {e}");
- continue;
- }
- };
-
- for (remote_name, remote) in remotes {
- let start_time = *most_recent_timestamps.get(&remote_name).unwrap_or(&0);
- let remote_name_clone = remote_name.clone();
-
- let res = async {
- let most_recent_timestamp = match remote.ty {
- RemoteType::Pve => {
- let client = connection::make_pve_client(&remote)?;
- let metrics = client
- .cluster_metrics_export(Some(true), Some(false), Some(start_time))
- .await?;
-
- let most_recent =
- metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
- sender
- .send(RrdStoreRequest::Pve {
- remote: remote_name_clone,
- metrics,
- })
- .await?;
-
- Ok::<i64, Error>(most_recent)
- }
- RemoteType::Pbs => {
- let client = connection::make_pbs_client(&remote)?;
- let metrics = client.metrics(Some(true), Some(start_time)).await?;
-
- let most_recent =
- metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
-
- sender
- .send(RrdStoreRequest::Pbs {
- remote: remote_name_clone,
- metrics,
- })
- .await?;
-
- Ok::<i64, Error>(most_recent)
- }
- }?;
-
- Ok::<i64, Error>(most_recent_timestamp)
- }
- .await;
-
- match res {
- Ok(ts) => {
- most_recent_timestamps.insert(remote_name.to_string(), ts);
- }
- Err(err) => log::error!("failed to collect metrics for {remote_name}: {err}"),
- }
- }
+ let (trigger_collection_tx, trigger_collection_rx) = mpsc::channel(128);
+ if CONTROL_MESSAGE_TX.set(trigger_collection_tx).is_err() {
+ bail!("control message sender alread set");
}
+
+ tokio::spawn(async move {
+ let metric_collection_task_future = pin!(async move {
+ match MetricCollectionTask::new(metric_data_tx, trigger_collection_rx) {
+ Ok(mut task) => task.run().await,
+ Err(err) => log::error!("could not start metric collection task: {err}"),
+ }
+ });
+
+ let abort_future = pin!(proxmox_daemon::shutdown_future());
+ futures::future::select(metric_collection_task_future, abort_future).await;
+ });
+
+ tokio::spawn(async move {
+ let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
+ let abort_future = pin!(proxmox_daemon::shutdown_future());
+ futures::future::select(rrd_task_future, abort_future).await;
+ });
+
+ Ok(())
+}
+
+/// Schedule metric collection for a given remote as soon as possible.
+///
+/// Has no effect if the tx end of the channel has not been initialized yet.
+/// Returns an error if the mpsc channel has been closed already.
+pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+ if let Some(sender) = CONTROL_MESSAGE_TX.get() {
+ sender
+ .send(ControlMsg::TriggerMetricCollection(remote))
+ .await?;
+ }
+
+ Ok(())
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 04/24] metric collection: persist state after metric collection
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (2 preceding siblings ...)
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 03/24] metric collection: rework metric poll task Lukas Wagner
@ 2025-08-26 13:50 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 05/24] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
` (20 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:50 UTC (permalink / raw)
To: pdm-devel
The metric collection has to maintain some state, at the moment this
is only the most recent timestamp from all received metric data points.
We use this as a cut-off time when requesting metric from the remote, as
everything *older* than that should already be stored in the database.
Up until now, the state was only stored in memory, which means that it
was lost when the daemon restarted.
The changes in this commit makes the metric collection system
load/save the state from a file in
/var/lib/proxmox-datacenter-manager/metric-collection-state.json
The following data points are stored for every remote:
- most-recent-datapoint:
Timestamp of the most recent datapoint, used as a cut-off when
fetching new metrics
- last-collection:
`last-collection` field saves the timestamp of the last *successful*
metric collection for a remote as local timestamp.
- error:
String representation of any error that occured in the last collection
attempt
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
Notes:
Changes since v1:
- use .inspect_err(..).unwrap_or_default() instead of a manual match
in `MetricCollectionState::new`
.../src/metric_collection/collection_task.rs | 131 ++++++++++++------
server/src/metric_collection/mod.rs | 1 +
server/src/metric_collection/rrd_task.rs | 60 ++++++--
server/src/metric_collection/state.rs | 126 +++++++++++++++++
4 files changed, 268 insertions(+), 50 deletions(-)
create mode 100644 server/src/metric_collection/state.rs
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index a0dce8fd..c797a624 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -1,21 +1,31 @@
-use std::{collections::HashMap, sync::Arc, time::Duration};
+use std::{sync::Arc, time::Duration};
use anyhow::Error;
use tokio::{
sync::{
mpsc::{Receiver, Sender},
- OwnedSemaphorePermit, Semaphore,
+ oneshot, OwnedSemaphorePermit, Semaphore,
},
time::Interval,
};
use proxmox_section_config::typed::SectionConfigData;
+use proxmox_sys::fs::CreateOptions;
use pdm_api_types::remotes::{Remote, RemoteType};
use crate::{connection, task_utils};
-use super::rrd_task::RrdStoreRequest;
+use super::{
+ rrd_task::{RrdStoreRequest, RrdStoreResult},
+ state::{MetricCollectionState, RemoteStatus},
+};
+
+/// Location of the metric collection state file.
+const METRIC_COLLECTION_STATE_FILE: &str = concat!(
+ pdm_buildcfg::PDM_STATE_DIR_M!(),
+ "/metric-collection-state.json"
+);
pub const MAX_CONCURRENT_CONNECTIONS: usize = 20;
@@ -30,7 +40,7 @@ pub(super) enum ControlMsg {
/// Task which periodically collects metrics from all remotes and stores
/// them in the local metrics database.
pub(super) struct MetricCollectionTask {
- most_recent_timestamps: HashMap<String, i64>,
+ state: MetricCollectionState,
metric_data_tx: Sender<RrdStoreRequest>,
control_message_rx: Receiver<ControlMsg>,
}
@@ -41,8 +51,10 @@ impl MetricCollectionTask {
metric_data_tx: Sender<RrdStoreRequest>,
control_message_rx: Receiver<ControlMsg>,
) -> Result<Self, Error> {
+ let state = load_state()?;
+
Ok(Self {
- most_recent_timestamps: HashMap::new(),
+ state,
metric_data_tx,
control_message_rx,
})
@@ -70,6 +82,10 @@ impl MetricCollectionTask {
self.handle_control_message(message).await;
}
}
+
+ if let Err(err) = self.state.save() {
+ log::error!("could not update metric collection state: {err}");
+ }
}
}
@@ -142,7 +158,11 @@ impl MetricCollectionTask {
let mut handles = Vec::new();
for remote_name in remotes_to_fetch {
- let start_time = *self.most_recent_timestamps.get(remote_name).unwrap_or(&0);
+ let status = self
+ .state
+ .get_status(remote_name)
+ .cloned()
+ .unwrap_or_default();
// unwrap is okay here, acquire_* will only fail if `close` has been
// called on the semaphore.
@@ -152,7 +172,7 @@ impl MetricCollectionTask {
log::debug!("fetching remote '{}'", remote.id);
let handle = tokio::spawn(Self::fetch_single_remote(
remote,
- start_time,
+ status,
self.metric_data_tx.clone(),
permit,
));
@@ -166,8 +186,7 @@ impl MetricCollectionTask {
match res {
Ok(Ok(ts)) => {
- self.most_recent_timestamps
- .insert(remote_name.to_string(), ts);
+ self.state.set_status(remote_name, ts);
}
Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
Err(err) => {
@@ -183,45 +202,79 @@ impl MetricCollectionTask {
#[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
async fn fetch_single_remote(
remote: Remote,
- start_time: i64,
+ mut status: RemoteStatus,
sender: Sender<RrdStoreRequest>,
_permit: OwnedSemaphorePermit,
- ) -> Result<i64, Error> {
- let most_recent_timestamp = match remote.ty {
- RemoteType::Pve => {
- let client = connection::make_pve_client(&remote)?;
- let metrics = client
- .cluster_metrics_export(Some(true), Some(false), Some(start_time))
- .await?;
+ ) -> Result<RemoteStatus, Error> {
+ let (result_tx, result_rx) = oneshot::channel();
- let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+ let now = proxmox_time::epoch_i64();
- sender
- .send(RrdStoreRequest::Pve {
- remote: remote.id.clone(),
- metrics,
- })
- .await?;
+ let res: Result<RrdStoreResult, Error> = async {
+ match remote.ty {
+ RemoteType::Pve => {
+ let client = connection::make_pve_client(&remote)?;
+ let metrics = client
+ .cluster_metrics_export(
+ Some(true),
+ Some(false),
+ Some(status.most_recent_datapoint),
+ )
+ .await?;
- most_recent
+ sender
+ .send(RrdStoreRequest::Pve {
+ remote: remote.id.clone(),
+ metrics,
+ channel: result_tx,
+ })
+ .await?;
+ }
+ RemoteType::Pbs => {
+ let client = connection::make_pbs_client(&remote)?;
+ let metrics = client
+ .metrics(Some(true), Some(status.most_recent_datapoint))
+ .await?;
+
+ sender
+ .send(RrdStoreRequest::Pbs {
+ remote: remote.id.clone(),
+ metrics,
+ channel: result_tx,
+ })
+ .await?;
+ }
}
- RemoteType::Pbs => {
- let client = connection::make_pbs_client(&remote)?;
- let metrics = client.metrics(Some(true), Some(start_time)).await?;
- let most_recent = metrics.data.iter().fold(0, |acc, x| acc.max(x.timestamp));
+ result_rx.await.map_err(Error::from)
+ }
+ .await;
- sender
- .send(RrdStoreRequest::Pbs {
- remote: remote.id.clone(),
- metrics,
- })
- .await?;
-
- most_recent
+ match res {
+ Ok(result) => {
+ status.most_recent_datapoint = result.most_recent_timestamp;
+ status.last_collection = Some(now);
+ status.error = None;
}
- };
+ Err(err) => {
+ status.error = Some(err.to_string());
+ log::error!("coud not fetch metrics from '{}': {err}", remote.id);
+ }
+ }
- Ok(most_recent_timestamp)
+ Ok(status)
}
}
+
+/// Load the metric collection state file.
+pub(super) fn load_state() -> Result<MetricCollectionState, Error> {
+ let api_uid = pdm_config::api_user()?.uid;
+ let api_gid = pdm_config::api_group()?.gid;
+
+ let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
+
+ Ok(MetricCollectionState::new(
+ METRIC_COLLECTION_STATE_FILE.into(),
+ file_options,
+ ))
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 3d2df331..7f67beab 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -7,6 +7,7 @@ use tokio::sync::mpsc::{self, Sender};
mod collection_task;
pub mod rrd_cache;
mod rrd_task;
+mod state;
pub mod top_entities;
use collection_task::{ControlMsg, MetricCollectionTask};
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index a72945df..1c618f54 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -1,8 +1,10 @@
use anyhow::Error;
-use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
+use tokio::sync::{mpsc::Receiver, oneshot};
+
use proxmox_rrd::rrd::DataSourceType;
+
+use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
-use tokio::sync::mpsc::Receiver;
use super::rrd_cache;
@@ -14,6 +16,8 @@ pub(super) enum RrdStoreRequest {
remote: String,
/// Metric data.
metrics: ClusterMetrics,
+ /// Oneshot channel to return the [`RrdStoreResult`].
+ channel: oneshot::Sender<RrdStoreResult>,
},
/// Store PBS metrics.
Pbs {
@@ -21,9 +25,17 @@ pub(super) enum RrdStoreRequest {
remote: String,
/// Metric data.
metrics: Metrics,
+ /// Oneshot channel to return the [`RrdStoreResult`].
+ channel: oneshot::Sender<RrdStoreResult>,
},
}
+/// Result for a [`RrdStoreRequest`].
+pub(super) struct RrdStoreResult {
+ /// Most recent timestamp of any stored metric datapoint (UNIX epoch).
+ pub(super) most_recent_timestamp: i64,
+}
+
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
@@ -31,17 +43,43 @@ pub(super) async fn store_in_rrd_task(
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
// Involves some blocking file IO
- let res = tokio::task::spawn_blocking(move || match msg {
- RrdStoreRequest::Pve { remote, metrics } => {
- for data_point in metrics.data {
- store_metric_pve(&remote, &data_point);
+ let res = tokio::task::spawn_blocking(move || {
+ let mut most_recent_timestamp = 0;
+ let channel = match msg {
+ RrdStoreRequest::Pve {
+ remote,
+ metrics,
+ channel,
+ } => {
+ for data_point in metrics.data {
+ most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
+ store_metric_pve(&remote, &data_point);
+ }
+
+ channel
}
- }
- RrdStoreRequest::Pbs { remote, metrics } => {
- for data_point in metrics.data {
- store_metric_pbs(&remote, &data_point);
+ RrdStoreRequest::Pbs {
+ remote,
+ metrics,
+ channel,
+ } => {
+ for data_point in metrics.data {
+ most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
+ store_metric_pbs(&remote, &data_point);
+ }
+
+ channel
}
- }
+ };
+
+ if channel
+ .send(RrdStoreResult {
+ most_recent_timestamp,
+ })
+ .is_err()
+ {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
})
.await;
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
new file mode 100644
index 00000000..a0bf10fe
--- /dev/null
+++ b/server/src/metric_collection/state.rs
@@ -0,0 +1,126 @@
+use std::{
+ collections::HashMap,
+ path::{Path, PathBuf},
+};
+
+use anyhow::Error;
+use serde::{Deserialize, Serialize};
+
+use proxmox_sys::fs::CreateOptions;
+
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// Metric collection state file content.
+struct State {
+ remote_status: HashMap<String, RemoteStatus>,
+}
+
+#[derive(Serialize, Deserialize, Debug, Default, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// A remote's metric collection state.
+pub struct RemoteStatus {
+ /// Most recent datapoint - time stamp is based on remote time
+ pub most_recent_datapoint: i64,
+ /// Last successful metric collection - timestamp based on PDM's time
+ pub last_collection: Option<i64>,
+ #[serde(skip_serializing_if = "Option::is_none")]
+ /// Any error that occured during the last metric collection attempt.
+ pub error: Option<String>,
+}
+
+/// Manage and persist metric collection state.
+pub struct MetricCollectionState {
+ /// Path to the persisted state
+ path: PathBuf,
+ /// File owner/perms for the persisted state file
+ file_options: CreateOptions,
+ /// The current state
+ state: State,
+}
+
+impl MetricCollectionState {
+ /// Initialize state by trying to load the existing statefile. If the file does not exist,
+ /// state will be empty. If the file failed to load, state will be empty and
+ /// and error will be logged.
+ pub fn new(statefile: PathBuf, file_options: CreateOptions) -> Self {
+ let state = Self::load_or_default(&statefile)
+ .inspect_err(|err| {
+ log::error!("could not load metric collection state: {err}");
+ })
+ .unwrap_or_default();
+
+ Self {
+ path: statefile,
+ file_options,
+ state,
+ }
+ }
+
+ /// Set a remote's status.
+ pub fn set_status(&mut self, remote: String, remote_state: RemoteStatus) {
+ self.state.remote_status.insert(remote, remote_state);
+ }
+
+ /// Get a remote's status.
+ pub fn get_status(&self, remote: &str) -> Option<&RemoteStatus> {
+ self.state.remote_status.get(remote)
+ }
+
+ /// Persist the state to the statefile.
+ pub fn save(&self) -> Result<(), Error> {
+ let data = serde_json::to_vec_pretty(&self.state)?;
+ proxmox_sys::fs::replace_file(&self.path, &data, self.file_options, true)?;
+
+ Ok(())
+ }
+
+ fn load_or_default(path: &Path) -> Result<State, Error> {
+ let content = proxmox_sys::fs::file_read_optional_string(path)?;
+
+ if let Some(content) = content {
+ Ok(serde_json::from_str(&content)?)
+ } else {
+ Ok(Default::default())
+ }
+ }
+}
+
+#[cfg(test)]
+mod tests {
+ use proxmox_sys::fs::CreateOptions;
+
+ use super::*;
+
+ use crate::test_support::temp::NamedTempFile;
+
+ fn get_options() -> CreateOptions {
+ CreateOptions::new()
+ .owner(nix::unistd::Uid::effective())
+ .group(nix::unistd::Gid::effective())
+ .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
+ }
+
+ #[test]
+ fn save_and_load() -> Result<(), Error> {
+ let file = NamedTempFile::new(get_options())?;
+ let options = get_options();
+ let mut state = MetricCollectionState::new(file.path().into(), options.clone());
+
+ state.set_status(
+ "some-remote".into(),
+ RemoteStatus {
+ most_recent_datapoint: 1234,
+ ..Default::default()
+ },
+ );
+
+ state.save()?;
+
+ let state = MetricCollectionState::new(file.path().into(), options);
+
+ let status = state.get_status("some-remote").unwrap();
+ assert_eq!(status.most_recent_datapoint, 1234);
+
+ Ok(())
+ }
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 05/24] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (3 preceding siblings ...)
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 04/24] metric collection: persist state after metric collection Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 06/24] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
` (19 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
It makes sense to limit the collection frequency per remote, especially
since we will have the ability to trigger collection manually.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/metric_collection/collection_task.rs | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index c797a624..f6052b73 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -31,6 +31,8 @@ pub const MAX_CONCURRENT_CONNECTIONS: usize = 20;
/// Default metric collection interval.
pub const DEFAULT_COLLECTION_INTERVAL: u64 = 600;
+/// Minimum metric collection interval.
+pub const MIN_COLLECTION_INTERVAL: u64 = 10;
/// Control messages for the metric collection task.
pub(super) enum ControlMsg {
@@ -156,6 +158,7 @@ impl MetricCollectionTask {
) {
let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT_CONNECTIONS));
let mut handles = Vec::new();
+ let now = proxmox_time::epoch_i64();
for remote_name in remotes_to_fetch {
let status = self
@@ -164,6 +167,13 @@ impl MetricCollectionTask {
.cloned()
.unwrap_or_default();
+ if now - status.last_collection.unwrap_or(0) < MIN_COLLECTION_INTERVAL as i64 {
+ log::debug!(
+ "skipping metric collection for remote '{remote_name}' - data is recent enough"
+ );
+ continue;
+ }
+
// unwrap is okay here, acquire_* will only fail if `close` has been
// called on the semaphore.
let permit = Arc::clone(&semaphore).acquire_owned().await.unwrap();
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 06/24] metric collection: collect overdue metrics on startup/timer change
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (4 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 05/24] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 07/24] metric collection: add tests for the fetch_remotes function Lukas Wagner
` (18 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
Due to the fact that the timer fires at aligned points in time and might
now fire right away after being set up, it could happen that we get gaps
in the data if we change the timer interval or at daemon startup.
To mitigate this, on daemon startup and also if the collection interval
changes, we
- check if the time until the next scheduled regular collection
plus the time *since* the last successful collection exceeds
the collection interval
- if yes, we collect immediately
- if no, we do nothing and let the remote be collected at the
next timer tick
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
Notes:
Changes since v1:
- Document return values of `setup_timer`
.../src/metric_collection/collection_task.rs | 57 +++++++++++++++++--
1 file changed, 51 insertions(+), 6 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index f6052b73..5d19e1c5 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -1,4 +1,7 @@
-use std::{sync::Arc, time::Duration};
+use std::{
+ sync::Arc,
+ time::{Duration, Instant},
+};
use anyhow::Error;
use tokio::{
@@ -67,12 +70,17 @@ impl MetricCollectionTask {
/// This function never returns.
#[tracing::instrument(skip_all, name = "metric_collection_task")]
pub(super) async fn run(&mut self) {
- let mut timer = Self::setup_timer(DEFAULT_COLLECTION_INTERVAL);
+ let (mut timer, first_tick) = Self::setup_timer(DEFAULT_COLLECTION_INTERVAL);
log::debug!(
"metric collection starting up. Collection interval set to {} seconds.",
DEFAULT_COLLECTION_INTERVAL,
);
+ // Check and fetch any remote which would be overdue by the time the
+ // timer first fires.
+ if let Some(remote_config) = Self::load_remote_config() {
+ self.fetch_overdue(&remote_config, first_tick).await;
+ }
loop {
tokio::select! {
@@ -127,12 +135,16 @@ impl MetricCollectionTask {
/// Set up a [`tokio::time::Interval`] instance with the provided interval.
/// The timer will be aligned, e.g. an interval of `60` will let the timer
/// fire at minute boundaries.
- fn setup_timer(interval: u64) -> Interval {
+ ///
+ /// The return values are a tuple of the [`tokio::time::Interval`] timer instance
+ /// and the [`std::time::Instant`] at which the timer first fires.
+ fn setup_timer(interval: u64) -> (Interval, Instant) {
+ log::debug!("setting metric collection interval timer to {interval} seconds.",);
let mut timer = tokio::time::interval(Duration::from_secs(interval));
- let first_run = task_utils::next_aligned_instant(interval).into();
- timer.reset_at(first_run);
+ let first_run = task_utils::next_aligned_instant(interval);
+ timer.reset_at(first_run.into());
- timer
+ (timer, first_run)
}
/// Convenience helper to load `remote.cfg`, logging the error
@@ -208,6 +220,39 @@ impl MetricCollectionTask {
}
}
+ /// Fetch metric data from remotes which are overdue for collection.
+ ///
+ /// Use this on startup of the metric collection loop as well as
+ /// when the collection interval changes.
+ async fn fetch_overdue(
+ &mut self,
+ remote_config: &SectionConfigData<Remote>,
+ next_run: Instant,
+ ) {
+ let left_until_scheduled = next_run - Instant::now();
+ let now = proxmox_time::epoch_i64();
+
+ let mut overdue = Vec::new();
+
+ for (remote, _) in remote_config.iter() {
+ let last_collection = self
+ .state
+ .get_status(remote)
+ .and_then(|s| s.last_collection)
+ .unwrap_or(0);
+
+ let diff = now - last_collection;
+
+ if diff + left_until_scheduled.as_secs() as i64 > DEFAULT_COLLECTION_INTERVAL as i64 {
+ log::debug!(
+ "starting metric collection for remote '{remote}' - triggered because collection is overdue"
+ );
+ overdue.push(remote.into());
+ }
+ }
+ self.fetch_remotes(remote_config, &overdue).await;
+ }
+
/// Fetch a single remote.
#[tracing::instrument(skip_all, fields(remote = remote.id), name = "metric_collection_task")]
async fn fetch_single_remote(
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 07/24] metric collection: add tests for the fetch_remotes function
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (5 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 06/24] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 08/24] metric collection: add test for fetch_overdue Lukas Wagner
` (17 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
Since we have the option to have mocked remote clients and a mocked
remote config, as well as due to the decoupling of the metric collection
and RRD task, it is now viable to write a unit test for one of the core
pieces of the metric collection system.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 226 ++++++++++++++++++
server/src/metric_collection/state.rs | 18 +-
2 files changed, 231 insertions(+), 13 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 5d19e1c5..9e530b40 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -333,3 +333,229 @@ pub(super) fn load_state() -> Result<MetricCollectionState, Error> {
file_options,
))
}
+
+#[cfg(test)]
+pub(super) mod tests {
+ use std::sync::Once;
+
+ use anyhow::bail;
+ use http::StatusCode;
+
+ use pdm_api_types::Authid;
+ use pve_api_types::{ClusterMetrics, ClusterMetricsData};
+
+ use crate::{
+ connection::{ClientFactory, PveClient},
+ metric_collection::rrd_task::RrdStoreResult,
+ pbs_client::PbsClient,
+ test_support::temp::NamedTempFile,
+ };
+
+ use super::*;
+
+ pub(crate) fn get_create_options() -> CreateOptions {
+ CreateOptions::new()
+ .owner(nix::unistd::Uid::effective())
+ .group(nix::unistd::Gid::effective())
+ .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
+ }
+
+ struct TestClientFactory {
+ now: i64,
+ }
+
+ #[async_trait::async_trait]
+ impl ClientFactory for TestClientFactory {
+ fn make_pve_client(&self, remote: &Remote) -> Result<Arc<PveClient>, Error> {
+ Ok(Arc::new(TestPveClient {
+ fail: remote.id.contains("fail"),
+ now: self.now,
+ }))
+ }
+ /// Create a new API client for PVE remotes, but with a specific endpoint.
+ fn make_pve_client_with_endpoint(
+ &self,
+ _remote: &Remote,
+ _target_endpoint: Option<&str>,
+ ) -> Result<Arc<PveClient>, Error> {
+ bail!("not implemented")
+ }
+
+ fn make_pbs_client(&self, _remote: &Remote) -> Result<Box<PbsClient>, Error> {
+ bail!("not implemented")
+ }
+
+ async fn make_pve_client_and_login(
+ &self,
+ _remote: &Remote,
+ ) -> Result<Arc<PveClient>, Error> {
+ bail!("not implemented")
+ }
+
+ async fn make_pbs_client_and_login(
+ &self,
+ _remote: &Remote,
+ ) -> Result<Box<PbsClient>, Error> {
+ bail!("not implemented")
+ }
+ }
+
+ struct TestPveClient {
+ now: i64,
+ fail: bool,
+ }
+
+ #[async_trait::async_trait]
+ impl pve_api_types::client::PveClient for TestPveClient {
+ /// Retrieve metrics of the cluster.
+ async fn cluster_metrics_export(
+ &self,
+ _history: Option<bool>,
+ _local_only: Option<bool>,
+ start_time: Option<i64>,
+ ) -> Result<ClusterMetrics, proxmox_client::Error> {
+ if self.fail {
+ return Err(proxmox_client::Error::Api(
+ StatusCode::INTERNAL_SERVER_ERROR,
+ "internal server error".into(),
+ ));
+ }
+
+ let mut time = start_time.unwrap_or(0);
+ time = time.max(self.now - (30 * 60));
+ let mut data = Vec::new();
+
+ use pve_api_types::ClusterMetricsDataType::*;
+
+ while time < self.now {
+ let point = |id: &str, metric: &str, timestamp, ty| ClusterMetricsData {
+ id: id.into(),
+ metric: metric.into(),
+ timestamp,
+ ty,
+ value: 0.1,
+ };
+
+ for i in 0..5 {
+ let id = format!("node/node-{i}");
+ data.push(point(&id, "cpu_current", time, Gauge));
+ }
+
+ // Advance time by 10 seconds
+ time += 10;
+ }
+
+ Ok(ClusterMetrics { data })
+ }
+ }
+
+ fn make_remote_config() -> SectionConfigData<Remote> {
+ let mut sections = SectionConfigData::default();
+
+ for i in 0..4 {
+ let status = if i >= 2 { "fail" } else { "pass" };
+ let name = format!("pve-{i}-{status}");
+
+ sections.insert(
+ name.clone(),
+ Remote {
+ ty: pdm_api_types::remotes::RemoteType::Pve,
+ id: name.clone(),
+ nodes: Vec::new(),
+ authid: Authid::root_auth_id().clone(),
+ token: "".into(),
+ web_url: None,
+ },
+ );
+ }
+
+ sections
+ }
+
+ async fn fake_rrd_task(mut rx: Receiver<RrdStoreRequest>) -> u32 {
+ let mut number_of_requests = 0;
+
+ while let Some(request) = rx.recv().await {
+ number_of_requests += 1;
+
+ if let RrdStoreRequest::Pve {
+ metrics, channel, ..
+ } = request
+ {
+ let most_recent_timestamp =
+ metrics.data.iter().map(|e| e.timestamp).max().unwrap_or(0);
+
+ let _ = channel.send(RrdStoreResult {
+ most_recent_timestamp,
+ });
+ }
+ }
+
+ number_of_requests
+ }
+
+ static START: Once = Once::new();
+
+ fn test_init() -> i64 {
+ let now = 10000;
+ START.call_once(|| {
+ // TODO: the client factory is currently stored in a OnceLock -
+ // we can only set it from one test... Ideally we'd like to have the
+ // option to set it in every single test if needed - task/thread local?
+ connection::init(Box::new(TestClientFactory { now }));
+ });
+
+ now
+ }
+
+ #[tokio::test]
+ async fn test_fetch_remotes_updates_state() {
+ // Arrange
+ let now = test_init();
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let handle = tokio::task::spawn(fake_rrd_task(rx));
+
+ let config = make_remote_config();
+
+ let state_file = NamedTempFile::new(get_create_options()).unwrap();
+ let state = MetricCollectionState::new(state_file.path().into(), get_create_options());
+
+ let (_control_tx, control_rx) = tokio::sync::mpsc::channel(10);
+
+ let mut task = MetricCollectionTask {
+ state,
+ metric_data_tx: tx,
+ control_message_rx: control_rx,
+ };
+
+ // Act
+ let to_fetch = config
+ .iter()
+ .map(|(name, _)| name.into())
+ .collect::<Vec<String>>();
+ task.fetch_remotes(&config, &to_fetch).await;
+
+ // Assert
+ for remote in &to_fetch {
+ let status = task.state.get_status(remote).unwrap();
+
+ // Our faked PVE client will return an error if the remote name contains
+ // 'fail'.
+ if remote.contains("fail") {
+ assert!(status
+ .error
+ .as_ref()
+ .unwrap()
+ .contains("internal server error"));
+ assert_eq!(status.last_collection, None);
+ } else {
+ assert!(now - status.most_recent_datapoint <= 10);
+ assert!(status.error.is_none());
+ }
+ }
+
+ drop(task);
+ assert_eq!(handle.await.unwrap(), 2);
+ }
+}
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
index a0bf10fe..e76444a9 100644
--- a/server/src/metric_collection/state.rs
+++ b/server/src/metric_collection/state.rs
@@ -87,24 +87,16 @@ impl MetricCollectionState {
#[cfg(test)]
mod tests {
- use proxmox_sys::fs::CreateOptions;
+ use crate::metric_collection::collection_task::tests::get_create_options;
+ use crate::test_support::temp::NamedTempFile;
use super::*;
- use crate::test_support::temp::NamedTempFile;
-
- fn get_options() -> CreateOptions {
- CreateOptions::new()
- .owner(nix::unistd::Uid::effective())
- .group(nix::unistd::Gid::effective())
- .perm(nix::sys::stat::Mode::from_bits_truncate(0o600))
- }
-
#[test]
fn save_and_load() -> Result<(), Error> {
- let file = NamedTempFile::new(get_options())?;
- let options = get_options();
- let mut state = MetricCollectionState::new(file.path().into(), options.clone());
+ let file = NamedTempFile::new(get_create_options())?;
+ let options = get_create_options();
+ let mut state = MetricCollectionState::new(file.path().into(), options);
state.set_status(
"some-remote".into(),
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 08/24] metric collection: add test for fetch_overdue
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (6 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 07/24] metric collection: add tests for the fetch_remotes function Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 09/24] metric collection: pass rrd cache instance as function parameter Lukas Wagner
` (16 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This test ensure that the logic for fetching metrics from remotes which
are overdue for collection continues to work as expected.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 62 ++++++++++++++++++-
1 file changed, 60 insertions(+), 2 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 9e530b40..c3448995 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -79,7 +79,8 @@ impl MetricCollectionTask {
// Check and fetch any remote which would be overdue by the time the
// timer first fires.
if let Some(remote_config) = Self::load_remote_config() {
- self.fetch_overdue(&remote_config, first_tick).await;
+ self.fetch_overdue(&remote_config, first_tick, DEFAULT_COLLECTION_INTERVAL)
+ .await;
}
loop {
@@ -228,6 +229,7 @@ impl MetricCollectionTask {
&mut self,
remote_config: &SectionConfigData<Remote>,
next_run: Instant,
+ collection_interval: u64,
) {
let left_until_scheduled = next_run - Instant::now();
let now = proxmox_time::epoch_i64();
@@ -243,7 +245,7 @@ impl MetricCollectionTask {
let diff = now - last_collection;
- if diff + left_until_scheduled.as_secs() as i64 > DEFAULT_COLLECTION_INTERVAL as i64 {
+ if diff + left_until_scheduled.as_secs() as i64 > collection_interval as i64 {
log::debug!(
"starting metric collection for remote '{remote}' - triggered because collection is overdue"
);
@@ -558,4 +560,60 @@ pub(super) mod tests {
drop(task);
assert_eq!(handle.await.unwrap(), 2);
}
+
+ #[tokio::test]
+ async fn test_fetch_overdue() {
+ // Arrange
+ test_init();
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let handle = tokio::task::spawn(fake_rrd_task(rx));
+
+ let config = make_remote_config();
+
+ let state_file = NamedTempFile::new(get_create_options()).unwrap();
+ let mut state = MetricCollectionState::new(state_file.path().into(), get_create_options());
+
+ let now = proxmox_time::epoch_i64();
+
+ // This one should be fetched
+ state.set_status(
+ "pve-0-pass".into(),
+ RemoteStatus {
+ last_collection: Some(now - 35),
+ ..Default::default()
+ },
+ );
+ // This one should *not* be fetched
+ state.set_status(
+ "pve-1-pass".into(),
+ RemoteStatus {
+ last_collection: Some(now - 25),
+ ..Default::default()
+ },
+ );
+
+ let (_control_tx, control_rx) = tokio::sync::mpsc::channel(10);
+
+ let mut task = MetricCollectionTask {
+ state,
+ metric_data_tx: tx,
+ control_message_rx: control_rx,
+ };
+
+ let next_collection = Instant::now() + Duration::from_secs(30);
+
+ // Act
+ task.fetch_overdue(&config, next_collection, 60).await;
+
+ // Assert
+ let status = task.state.get_status("pve-0-pass").unwrap();
+ assert!(status.last_collection.unwrap() - now >= 0);
+
+ let status = task.state.get_status("pve-1-pass").unwrap();
+ assert_eq!(status.last_collection.unwrap(), now - 25);
+
+ drop(task);
+ assert_eq!(handle.await.unwrap(), 1);
+ }
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 09/24] metric collection: pass rrd cache instance as function parameter
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (7 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 08/24] metric collection: add test for fetch_overdue Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 10/24] metric collection: add test for rrd task Lukas Wagner
` (15 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This enables us to do dependency injection for tests by passing in the
rrd cache to use, instead of having to replace the global static
instance. The latter is always awkward because tests might run
multi-threaded, so the global instance could/should be a thread-local,
but this again is weird in tokio-world where we actually want task-local
variables, which in turn are weird once you actually start new tasks,
which then don't access the same task-local variables as their parent
task... - long story short, passing in the dependency as a parameter
makes things easier.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/api/rrd_common.rs | 4 +-
server/src/metric_collection/mod.rs | 15 +++-
server/src/metric_collection/rrd_cache.rs | 77 ++++++++++++--------
server/src/metric_collection/rrd_task.rs | 16 ++--
server/src/metric_collection/top_entities.rs | 3 +
5 files changed, 75 insertions(+), 40 deletions(-)
diff --git a/server/src/api/rrd_common.rs b/server/src/api/rrd_common.rs
index 0d82d0c3..d9ed017a 100644
--- a/server/src/api/rrd_common.rs
+++ b/server/src/api/rrd_common.rs
@@ -23,9 +23,11 @@ pub fn create_datapoints_from_rrd<T: DataPoint>(
let mut timemap = BTreeMap::new();
let mut last_resolution = None;
+ let cache = rrd_cache::get_cache();
+
for name in T::fields() {
let (start, resolution, data) =
- match rrd_cache::extract_data(basedir, name, timeframe, mode)? {
+ match rrd_cache::extract_data(&cache, basedir, name, timeframe, mode)? {
Some(data) => data.into(),
None => continue,
};
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 7f67beab..3f0c7f90 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -4,6 +4,8 @@ use std::sync::OnceLock;
use anyhow::{bail, Error};
use tokio::sync::mpsc::{self, Sender};
+use proxmox_sys::fs::CreateOptions;
+
mod collection_task;
pub mod rrd_cache;
mod rrd_task;
@@ -16,7 +18,14 @@ static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
/// Initialize the RRD cache
pub fn init() -> Result<(), Error> {
- rrd_cache::init()?;
+ let api_uid = pdm_config::api_user()?.uid;
+ let api_gid = pdm_config::api_group()?.gid;
+
+ let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
+ let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
+
+ let cache = rrd_cache::init(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
+ rrd_cache::set_cache(cache)?;
Ok(())
}
@@ -42,8 +51,10 @@ pub fn start_task() -> Result<(), Error> {
futures::future::select(metric_collection_task_future, abort_future).await;
});
+ let cache = rrd_cache::get_cache();
+
tokio::spawn(async move {
- let rrd_task_future = pin!(rrd_task::store_in_rrd_task(metric_data_rx));
+ let rrd_task_future = pin!(rrd_task::store_in_rrd_task(cache, metric_data_rx));
let abort_future = pin!(proxmox_daemon::shutdown_future());
futures::future::select(rrd_task_future, abort_future).await;
});
diff --git a/server/src/metric_collection/rrd_cache.rs b/server/src/metric_collection/rrd_cache.rs
index a8d72b80..70c91ca6 100644
--- a/server/src/metric_collection/rrd_cache.rs
+++ b/server/src/metric_collection/rrd_cache.rs
@@ -5,6 +5,7 @@
//! and update RRD data inside `proxmox-datacenter-api`.
use std::path::Path;
+use std::sync::Arc;
use anyhow::{format_err, Error};
use once_cell::sync::OnceCell;
@@ -16,32 +17,45 @@ use proxmox_sys::fs::CreateOptions;
use pdm_buildcfg::PDM_STATE_DIR_M;
-const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
+pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
-static RRD_CACHE: OnceCell<Cache> = OnceCell::new();
+// This is an `Arc` because this makes it easier to do dependency injection
+// in test contexts.
+//
+// For DI in testing, we want to pass in a reference to the Cache
+// as a function parameter. In a couple of these functions we
+// spawn tokio tasks which need access to the reference, hence the
+// reference needs to be 'static. In a test context, we kind of have a
+// hard time to come up with a 'static reference, so we just
+// wrap the cache in an `Arc` for now, solving the
+// lifetime problem via refcounting.
+static RRD_CACHE: OnceCell<Arc<Cache>> = OnceCell::new();
/// Get the RRD cache instance
-fn get_cache() -> Result<&'static Cache, Error> {
+pub fn get_cache() -> Arc<Cache> {
+ RRD_CACHE.get().cloned().expect("rrd cache not initialized")
+}
+
+pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
RRD_CACHE
- .get()
- .ok_or_else(|| format_err!("RRD cache not initialized!"))
+ .set(cache)
+ .map_err(|_| format_err!("RRD cache already initialized!"))?;
+
+ Ok(())
}
/// Initialize the RRD cache instance
///
/// Note: Only a single process must do this (proxmox-datacenter-api)
-pub fn init() -> Result<&'static Cache, Error> {
- let api_uid = pdm_config::api_user()?.uid;
- let api_gid = pdm_config::api_group()?.gid;
-
- let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
- let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
+pub fn init<P: AsRef<Path>>(
+ base_path: P,
+ dir_options: CreateOptions,
+ file_options: CreateOptions,
+) -> Result<Arc<Cache>, Error> {
let apply_interval = 30.0 * 60.0; // 30 minutes
let cache = Cache::new(
- RRD_CACHE_BASEDIR,
+ base_path,
Some(file_options),
Some(dir_options),
apply_interval,
@@ -51,11 +65,7 @@ pub fn init() -> Result<&'static Cache, Error> {
cache.apply_journal()?;
- RRD_CACHE
- .set(cache)
- .map_err(|_| format_err!("RRD cache already initialized!"))?;
-
- Ok(RRD_CACHE.get().unwrap())
+ Ok(Arc::new(cache))
}
fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
@@ -91,6 +101,7 @@ fn create_callback(dst: DataSourceType) -> Database {
/// Extracts data for the specified time frame from RRD cache
pub fn extract_data(
+ rrd_cache: &Cache,
basedir: &str,
name: &str,
timeframe: RrdTimeframe,
@@ -112,26 +123,28 @@ pub fn extract_data(
RrdMode::Average => AggregationFn::Average,
};
- let rrd_cache = get_cache()?;
-
rrd_cache.extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
}
/// Sync/Flush the RRD journal
pub fn sync_journal() {
- if let Ok(rrd_cache) = get_cache() {
- if let Err(err) = rrd_cache.sync_journal() {
- log::error!("rrd_sync_journal failed - {err}");
- }
+ let rrd_cache = get_cache();
+ if let Err(err) = rrd_cache.sync_journal() {
+ log::error!("rrd_sync_journal failed - {err}");
}
}
+
/// Update RRD Gauge values
-pub fn update_value(name: &str, value: f64, timestamp: i64, datasource_type: DataSourceType) {
- if let Ok(rrd_cache) = get_cache() {
- if let Err(err) =
- rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
- {
- log::error!("rrd::update_value '{name}' failed - {err}");
- }
+pub fn update_value(
+ rrd_cache: &Cache,
+ name: &str,
+ value: f64,
+ timestamp: i64,
+ datasource_type: DataSourceType,
+) {
+ if let Err(err) =
+ rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
+ {
+ log::error!("rrd::update_value '{name}' failed - {err}");
}
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 1c618f54..3704b0e7 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -1,7 +1,9 @@
+use std::sync::Arc;
+
use anyhow::Error;
use tokio::sync::{mpsc::Receiver, oneshot};
-use proxmox_rrd::rrd::DataSourceType;
+use proxmox_rrd::{rrd::DataSourceType, Cache};
use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
@@ -39,9 +41,11 @@ pub(super) struct RrdStoreResult {
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
+ cache: Arc<Cache>,
mut receiver: Receiver<RrdStoreRequest>,
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
+ let cache_clone = Arc::clone(&cache);
// Involves some blocking file IO
let res = tokio::task::spawn_blocking(move || {
let mut most_recent_timestamp = 0;
@@ -53,7 +57,7 @@ pub(super) async fn store_in_rrd_task(
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
- store_metric_pve(&remote, &data_point);
+ store_metric_pve(&cache_clone, &remote, &data_point);
}
channel
@@ -65,7 +69,7 @@ pub(super) async fn store_in_rrd_task(
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
- store_metric_pbs(&remote, &data_point);
+ store_metric_pbs(&cache_clone, &remote, &data_point);
}
channel
@@ -91,7 +95,7 @@ pub(super) async fn store_in_rrd_task(
Ok(())
}
-fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
+fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetricsData) {
let name = format!(
"pve/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -105,6 +109,7 @@ fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
};
rrd_cache::update_value(
+ cache,
&name,
data_point.value,
data_point.timestamp,
@@ -112,7 +117,7 @@ fn store_metric_pve(remote_name: &str, data_point: &ClusterMetricsData) {
);
}
-fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
+fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoint) {
let name = format!(
"pbs/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -126,6 +131,7 @@ fn store_metric_pbs(remote_name: &str, data_point: &MetricDataPoint) {
};
rrd_cache::update_value(
+ cache,
&name,
data_point.value,
data_point.timestamp,
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
index f8e053fb..f54ee72f 100644
--- a/server/src/metric_collection/top_entities.rs
+++ b/server/src/metric_collection/top_entities.rs
@@ -121,7 +121,10 @@ fn get_entity(
name: String,
metric: &str,
) -> Option<(f64, TopEntity)> {
+ let cache = rrd_cache::get_cache();
+
if let Ok(Some(values)) = rrd_cache::extract_data(
+ &cache,
&name,
metric,
timeframe,
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 10/24] metric collection: add test for rrd task
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (8 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 09/24] metric collection: pass rrd cache instance as function parameter Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 11/24] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
` (14 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This commit adds a quick smoke-test for the `store_in_rrd` task
function. Unfortunately, due to some race-condition in proxmox_rrd,
some extra measures had to be taken in order to make the test not flaky.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/metric_collection/rrd_task.rs | 92 ++++++++++++++++++++++++
1 file changed, 92 insertions(+)
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 3704b0e7..27327a29 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -138,3 +138,95 @@ fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoi
data_source_type,
);
}
+
+#[cfg(test)]
+mod tests {
+ use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
+ use pve_api_types::{ClusterMetrics, ClusterMetricsData};
+
+ use crate::{
+ metric_collection::collection_task::tests::get_create_options,
+ test_support::temp::NamedTempDir,
+ };
+
+ use super::*;
+
+ #[tokio::test]
+ async fn test_rrd_task_persists_data() -> Result<(), Error> {
+ // Arrange
+ let dir = NamedTempDir::new()?;
+ let options = get_create_options().perm(nix::sys::stat::Mode::from_bits_truncate(0o700));
+ let cache = rrd_cache::init(&dir.path(), options.clone(), options.clone())?;
+
+ let (tx, rx) = tokio::sync::mpsc::channel(10);
+ let task = store_in_rrd_task(Arc::clone(&cache), rx);
+ let handle = tokio::task::spawn(task);
+
+ let now = proxmox_time::epoch_i64();
+
+ let metrics = ClusterMetrics {
+ data: vec![
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 30,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.1,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 20,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.2,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now - 10,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.1,
+ },
+ ClusterMetricsData {
+ id: "node/some-node".into(),
+ metric: "cpu_current".into(),
+ timestamp: now,
+ ty: ClusterMetricsDataType::Gauge,
+ value: 0.2,
+ },
+ ],
+ };
+ let (tx_back, rx_back) = tokio::sync::oneshot::channel();
+ let request = RrdStoreRequest::Pve {
+ remote: "some-remote".into(),
+ metrics,
+ channel: tx_back,
+ };
+
+ // Act
+ tx.send(request).await?;
+ let result = rx_back.await?;
+
+ // Assert
+ assert_eq!(result.most_recent_timestamp, now);
+
+ drop(tx);
+ handle.await??;
+
+ // There is some race condition in proxmox_rrd, in some rare cases
+ // extract_data does not return any data directly after writing.
+ if let Some(data) = rrd_cache::extract_data(
+ &cache,
+ "pve/some-remote/node/some-node",
+ "cpu_current",
+ RrdTimeframe::Hour,
+ RrdMode::Max,
+ )? {
+ // Only assert that there are some data points, the exact position in the vec
+ // might vary due to changed boundaries.
+ assert!(data.data.iter().any(Option::is_some));
+ }
+
+ Ok(())
+ }
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 11/24] metric collection: wrap rrd_cache::Cache in a struct
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (9 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 10/24] metric collection: add test for rrd task Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 12/24] metric collection: record remote response time in metric database Lukas Wagner
` (13 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
All helper functions storing/retrieving helper functions take the cache
instance as a first parameter. This smells like it should be a method on
a struct.
This commit wraps the foreign `proxmox_rrd::Cache` type in a new type,
transforms the `init` function into a `new` method and all other helpers
into further methods.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/api/rrd_common.rs | 9 +-
server/src/metric_collection/mod.rs | 19 +-
server/src/metric_collection/rrd_cache.rs | 205 +++++++++----------
server/src/metric_collection/rrd_task.rs | 21 +-
server/src/metric_collection/top_entities.rs | 3 +-
5 files changed, 125 insertions(+), 132 deletions(-)
diff --git a/server/src/api/rrd_common.rs b/server/src/api/rrd_common.rs
index d9ed017a..28868bc1 100644
--- a/server/src/api/rrd_common.rs
+++ b/server/src/api/rrd_common.rs
@@ -26,11 +26,10 @@ pub fn create_datapoints_from_rrd<T: DataPoint>(
let cache = rrd_cache::get_cache();
for name in T::fields() {
- let (start, resolution, data) =
- match rrd_cache::extract_data(&cache, basedir, name, timeframe, mode)? {
- Some(data) => data.into(),
- None => continue,
- };
+ let (start, resolution, data) = match cache.extract_data(basedir, name, timeframe, mode)? {
+ Some(data) => data.into(),
+ None => continue,
+ };
if let Some(expected_resolution) = last_resolution {
if resolution != expected_resolution {
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 3f0c7f90..41772a95 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -1,10 +1,12 @@
use std::pin::pin;
+use std::sync::Arc;
use std::sync::OnceLock;
use anyhow::{bail, Error};
+use nix::sys::stat::Mode;
use tokio::sync::mpsc::{self, Sender};
-use proxmox_sys::fs::CreateOptions;
+use pdm_buildcfg::PDM_STATE_DIR_M;
mod collection_task;
pub mod rrd_cache;
@@ -13,19 +15,20 @@ mod state;
pub mod top_entities;
use collection_task::{ControlMsg, MetricCollectionTask};
+use rrd_cache::RrdCache;
+
+const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
static CONTROL_MESSAGE_TX: OnceLock<Sender<ControlMsg>> = OnceLock::new();
/// Initialize the RRD cache
pub fn init() -> Result<(), Error> {
- let api_uid = pdm_config::api_user()?.uid;
- let api_gid = pdm_config::api_group()?.gid;
+ let file_options = proxmox_product_config::default_create_options();
+ let mode = Mode::from_bits_truncate(0o0750);
+ let dir_options = file_options.perm(mode);
- let file_options = CreateOptions::new().owner(api_uid).group(api_gid);
- let dir_options = CreateOptions::new().owner(api_uid).group(api_gid);
-
- let cache = rrd_cache::init(rrd_cache::RRD_CACHE_BASEDIR, dir_options, file_options)?;
- rrd_cache::set_cache(cache)?;
+ let cache = RrdCache::new(RRD_CACHE_BASEDIR, dir_options, file_options)?;
+ rrd_cache::set_cache(Arc::new(cache))?;
Ok(())
}
diff --git a/server/src/metric_collection/rrd_cache.rs b/server/src/metric_collection/rrd_cache.rs
index 70c91ca6..2ca2318c 100644
--- a/server/src/metric_collection/rrd_cache.rs
+++ b/server/src/metric_collection/rrd_cache.rs
@@ -15,10 +15,6 @@ use proxmox_rrd::Cache;
use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
use proxmox_sys::fs::CreateOptions;
-use pdm_buildcfg::PDM_STATE_DIR_M;
-
-pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
-
// This is an `Arc` because this makes it easier to do dependency injection
// in test contexts.
//
@@ -29,14 +25,14 @@ pub(super) const RRD_CACHE_BASEDIR: &str = concat!(PDM_STATE_DIR_M!(), "/rrdb");
// hard time to come up with a 'static reference, so we just
// wrap the cache in an `Arc` for now, solving the
// lifetime problem via refcounting.
-static RRD_CACHE: OnceCell<Arc<Cache>> = OnceCell::new();
+static RRD_CACHE: OnceCell<Arc<RrdCache>> = OnceCell::new();
/// Get the RRD cache instance
-pub fn get_cache() -> Arc<Cache> {
+pub fn get_cache() -> Arc<RrdCache> {
RRD_CACHE.get().cloned().expect("rrd cache not initialized")
}
-pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
+pub fn set_cache(cache: Arc<RrdCache>) -> Result<(), Error> {
RRD_CACHE
.set(cache)
.map_err(|_| format_err!("RRD cache already initialized!"))?;
@@ -44,107 +40,106 @@ pub fn set_cache(cache: Arc<Cache>) -> Result<(), Error> {
Ok(())
}
-/// Initialize the RRD cache instance
-///
-/// Note: Only a single process must do this (proxmox-datacenter-api)
-pub fn init<P: AsRef<Path>>(
- base_path: P,
- dir_options: CreateOptions,
- file_options: CreateOptions,
-) -> Result<Arc<Cache>, Error> {
- let apply_interval = 30.0 * 60.0; // 30 minutes
-
- let cache = Cache::new(
- base_path,
- Some(file_options),
- Some(dir_options),
- apply_interval,
- load_callback,
- create_callback,
- )?;
-
- cache.apply_journal()?;
-
- Ok(Arc::new(cache))
+/// Wrapper for proxmox_rrd::Cache to accomodate helper methods.
+pub struct RrdCache {
+ cache: Cache,
}
-fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
- match Database::load(path, true) {
- Ok(rrd) => Some(rrd),
- Err(err) => {
- if err.kind() != std::io::ErrorKind::NotFound {
- log::warn!("overwriting RRD file {path:?}, because of load error: {err}",);
+impl RrdCache {
+ /// Create a new RrdCache instance
+ pub fn new<P: AsRef<Path>>(
+ base_path: P,
+ dir_options: CreateOptions,
+ file_options: CreateOptions,
+ ) -> Result<Self, Error> {
+ let apply_interval = 30.0 * 60.0; // 30 minutes
+
+ let cache = Cache::new(
+ base_path,
+ Some(file_options),
+ Some(dir_options),
+ apply_interval,
+ Self::load_callback,
+ Self::create_callback,
+ )?;
+
+ cache.apply_journal()?;
+
+ Ok(Self { cache })
+ }
+
+ fn load_callback(path: &Path, _rel_path: &str) -> Option<Database> {
+ match Database::load(path, true) {
+ Ok(rrd) => Some(rrd),
+ Err(err) => {
+ if err.kind() != std::io::ErrorKind::NotFound {
+ log::warn!("overwriting RRD file {path:?}, because of load error: {err}",);
+ }
+ None
}
- None
+ }
+ }
+
+ fn create_callback(dst: DataSourceType) -> Database {
+ let rra_list = vec![
+ // 1 min * 1440 => 1 day
+ Archive::new(AggregationFn::Average, 60, 1440),
+ Archive::new(AggregationFn::Maximum, 60, 1440),
+ // 30 min * 1440 => 30 days ~ 1 month
+ Archive::new(AggregationFn::Average, 30 * 60, 1440),
+ Archive::new(AggregationFn::Maximum, 30 * 60, 1440),
+ // 6 h * 1440 => 360 days ~ 1 year
+ Archive::new(AggregationFn::Average, 6 * 3600, 1440),
+ Archive::new(AggregationFn::Maximum, 6 * 3600, 1440),
+ // 1 week * 570 => 10 years
+ Archive::new(AggregationFn::Average, 7 * 86400, 570),
+ Archive::new(AggregationFn::Maximum, 7 * 86400, 570),
+ ];
+
+ Database::new(dst, rra_list)
+ }
+
+ /// Extracts data for the specified time frame from RRD cache
+ pub fn extract_data(
+ &self,
+ basedir: &str,
+ name: &str,
+ timeframe: RrdTimeframe,
+ mode: RrdMode,
+ ) -> Result<Option<proxmox_rrd::Entry>, Error> {
+ let end = proxmox_time::epoch_f64() as u64;
+
+ let (start, resolution) = match timeframe {
+ RrdTimeframe::Hour => (end - 3600, 60),
+ RrdTimeframe::Day => (end - 3600 * 24, 60),
+ RrdTimeframe::Week => (end - 3600 * 24 * 7, 30 * 60),
+ RrdTimeframe::Month => (end - 3600 * 24 * 30, 30 * 60),
+ RrdTimeframe::Year => (end - 3600 * 24 * 365, 6 * 60 * 60),
+ RrdTimeframe::Decade => (end - 10 * 3600 * 24 * 366, 7 * 86400),
+ };
+
+ let cf = match mode {
+ RrdMode::Max => AggregationFn::Maximum,
+ RrdMode::Average => AggregationFn::Average,
+ };
+
+ self.cache
+ .extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
+ }
+
+ /// Update RRD Gauge values
+ pub fn update_value(
+ &self,
+ name: &str,
+ value: f64,
+ timestamp: i64,
+ datasource_type: DataSourceType,
+ ) {
+ if let Err(err) =
+ self.cache
+ .update_value_ignore_old(name, timestamp as f64, value, datasource_type)
+ {
+ log::error!("rrd::update_value '{name}' failed - {err}");
}
}
}
-
-fn create_callback(dst: DataSourceType) -> Database {
- let rra_list = vec![
- // 1 min * 1440 => 1 day
- Archive::new(AggregationFn::Average, 60, 1440),
- Archive::new(AggregationFn::Maximum, 60, 1440),
- // 30 min * 1440 => 30 days ~ 1 month
- Archive::new(AggregationFn::Average, 30 * 60, 1440),
- Archive::new(AggregationFn::Maximum, 30 * 60, 1440),
- // 6 h * 1440 => 360 days ~ 1 year
- Archive::new(AggregationFn::Average, 6 * 3600, 1440),
- Archive::new(AggregationFn::Maximum, 6 * 3600, 1440),
- // 1 week * 570 => 10 years
- Archive::new(AggregationFn::Average, 7 * 86400, 570),
- Archive::new(AggregationFn::Maximum, 7 * 86400, 570),
- ];
-
- Database::new(dst, rra_list)
-}
-
-/// Extracts data for the specified time frame from RRD cache
-pub fn extract_data(
- rrd_cache: &Cache,
- basedir: &str,
- name: &str,
- timeframe: RrdTimeframe,
- mode: RrdMode,
-) -> Result<Option<proxmox_rrd::Entry>, Error> {
- let end = proxmox_time::epoch_f64() as u64;
-
- let (start, resolution) = match timeframe {
- RrdTimeframe::Hour => (end - 3600, 60),
- RrdTimeframe::Day => (end - 3600 * 24, 60),
- RrdTimeframe::Week => (end - 3600 * 24 * 7, 30 * 60),
- RrdTimeframe::Month => (end - 3600 * 24 * 30, 30 * 60),
- RrdTimeframe::Year => (end - 3600 * 24 * 365, 6 * 60 * 60),
- RrdTimeframe::Decade => (end - 10 * 3600 * 24 * 366, 7 * 86400),
- };
-
- let cf = match mode {
- RrdMode::Max => AggregationFn::Maximum,
- RrdMode::Average => AggregationFn::Average,
- };
-
- rrd_cache.extract_cached_data(basedir, name, cf, resolution, Some(start), Some(end))
-}
-
-/// Sync/Flush the RRD journal
-pub fn sync_journal() {
- let rrd_cache = get_cache();
- if let Err(err) = rrd_cache.sync_journal() {
- log::error!("rrd_sync_journal failed - {err}");
- }
-}
-
-/// Update RRD Gauge values
-pub fn update_value(
- rrd_cache: &Cache,
- name: &str,
- value: f64,
- timestamp: i64,
- datasource_type: DataSourceType,
-) {
- if let Err(err) =
- rrd_cache.update_value_ignore_old(name, timestamp as f64, value, datasource_type)
- {
- log::error!("rrd::update_value '{name}' failed - {err}");
- }
-}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 27327a29..f65a9291 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -3,12 +3,12 @@ use std::sync::Arc;
use anyhow::Error;
use tokio::sync::{mpsc::Receiver, oneshot};
-use proxmox_rrd::{rrd::DataSourceType, Cache};
+use proxmox_rrd::rrd::DataSourceType;
use pbs_api_types::{MetricDataPoint, MetricDataType, Metrics};
use pve_api_types::{ClusterMetrics, ClusterMetricsData, ClusterMetricsDataType};
-use super::rrd_cache;
+use super::rrd_cache::RrdCache;
/// Store request for the RRD task.
pub(super) enum RrdStoreRequest {
@@ -41,7 +41,7 @@ pub(super) struct RrdStoreResult {
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
- cache: Arc<Cache>,
+ cache: Arc<RrdCache>,
mut receiver: Receiver<RrdStoreRequest>,
) -> Result<(), Error> {
while let Some(msg) = receiver.recv().await {
@@ -95,7 +95,7 @@ pub(super) async fn store_in_rrd_task(
Ok(())
}
-fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetricsData) {
+fn store_metric_pve(cache: &RrdCache, remote_name: &str, data_point: &ClusterMetricsData) {
let name = format!(
"pve/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -108,8 +108,7 @@ fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetric
ClusterMetricsDataType::Derive => DataSourceType::Derive,
};
- rrd_cache::update_value(
- cache,
+ cache.update_value(
&name,
data_point.value,
data_point.timestamp,
@@ -117,7 +116,7 @@ fn store_metric_pve(cache: &Cache, remote_name: &str, data_point: &ClusterMetric
);
}
-fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoint) {
+fn store_metric_pbs(cache: &RrdCache, remote_name: &str, data_point: &MetricDataPoint) {
let name = format!(
"pbs/{remote_name}/{id}/{metric}",
id = data_point.id,
@@ -130,8 +129,7 @@ fn store_metric_pbs(cache: &Cache, remote_name: &str, data_point: &MetricDataPoi
MetricDataType::Derive => DataSourceType::Derive,
};
- rrd_cache::update_value(
- cache,
+ cache.update_value(
&name,
data_point.value,
data_point.timestamp,
@@ -156,7 +154,7 @@ mod tests {
// Arrange
let dir = NamedTempDir::new()?;
let options = get_create_options().perm(nix::sys::stat::Mode::from_bits_truncate(0o700));
- let cache = rrd_cache::init(&dir.path(), options.clone(), options.clone())?;
+ let cache = Arc::new(RrdCache::new(dir.path(), options, options)?);
let (tx, rx) = tokio::sync::mpsc::channel(10);
let task = store_in_rrd_task(Arc::clone(&cache), rx);
@@ -215,8 +213,7 @@ mod tests {
// There is some race condition in proxmox_rrd, in some rare cases
// extract_data does not return any data directly after writing.
- if let Some(data) = rrd_cache::extract_data(
- &cache,
+ if let Some(data) = cache.extract_data(
"pve/some-remote/node/some-node",
"cpu_current",
RrdTimeframe::Hour,
diff --git a/server/src/metric_collection/top_entities.rs b/server/src/metric_collection/top_entities.rs
index f54ee72f..31e36c34 100644
--- a/server/src/metric_collection/top_entities.rs
+++ b/server/src/metric_collection/top_entities.rs
@@ -123,8 +123,7 @@ fn get_entity(
) -> Option<(f64, TopEntity)> {
let cache = rrd_cache::get_cache();
- if let Ok(Some(values)) = rrd_cache::extract_data(
- &cache,
+ if let Ok(Some(values)) = cache.extract_data(
&name,
metric,
timeframe,
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 12/24] metric collection: record remote response time in metric database
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (10 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 11/24] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 13/24] metric collection: save time needed for collection run to RRD Lukas Wagner
` (12 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This gives us the ability to retrieve max/avg response times for a given
time window.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 13 +++++++-
server/src/metric_collection/rrd_task.rs | 31 +++++++++++++++++++
2 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index c3448995..51bcd78a 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -266,6 +266,7 @@ impl MetricCollectionTask {
let (result_tx, result_rx) = oneshot::channel();
let now = proxmox_time::epoch_i64();
+ let start = Instant::now();
let res: Result<RrdStoreResult, Error> = async {
match remote.ty {
@@ -279,11 +280,16 @@ impl MetricCollectionTask {
)
.await?;
+ let duration = start.elapsed();
+
sender
.send(RrdStoreRequest::Pve {
remote: remote.id.clone(),
metrics,
channel: result_tx,
+ // TODO: use as_millis_f64 once stabilized
+ response_time: duration.as_secs_f64() * 1000.,
+ request_at: now,
})
.await?;
}
@@ -293,15 +299,20 @@ impl MetricCollectionTask {
.metrics(Some(true), Some(status.most_recent_datapoint))
.await?;
+ let duration = start.elapsed();
+
sender
.send(RrdStoreRequest::Pbs {
remote: remote.id.clone(),
metrics,
channel: result_tx,
+ // TODO: use as_millis_f64 once stabilized
+ response_time: duration.as_secs_f64() * 1000.,
+ request_at: now,
})
.await?;
}
- }
+ };
result_rx.await.map_err(Error::from)
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index f65a9291..91927602 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -20,6 +20,10 @@ pub(super) enum RrdStoreRequest {
metrics: ClusterMetrics,
/// Oneshot channel to return the [`RrdStoreResult`].
channel: oneshot::Sender<RrdStoreResult>,
+ /// Reponse time in ms for the API request.
+ response_time: f64,
+ /// Timestamp at which the request was done (UNIX epoch).
+ request_at: i64,
},
/// Store PBS metrics.
Pbs {
@@ -29,6 +33,10 @@ pub(super) enum RrdStoreRequest {
metrics: Metrics,
/// Oneshot channel to return the [`RrdStoreResult`].
channel: oneshot::Sender<RrdStoreResult>,
+ /// Reponse time in ms for the API request.
+ response_time: f64,
+ /// Timestamp at which the request was done (UNIX epoch).
+ request_at: i64,
},
}
@@ -54,11 +62,14 @@ pub(super) async fn store_in_rrd_task(
remote,
metrics,
channel,
+ response_time,
+ request_at,
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
store_metric_pve(&cache_clone, &remote, &data_point);
}
+ store_response_time(&cache_clone, &remote, response_time, request_at);
channel
}
@@ -66,11 +77,14 @@ pub(super) async fn store_in_rrd_task(
remote,
metrics,
channel,
+ response_time,
+ request_at,
} => {
for data_point in metrics.data {
most_recent_timestamp = most_recent_timestamp.max(data_point.timestamp);
store_metric_pbs(&cache_clone, &remote, &data_point);
}
+ store_response_time(&cache_clone, &remote, response_time, request_at);
channel
}
@@ -137,6 +151,12 @@ fn store_metric_pbs(cache: &RrdCache, remote_name: &str, data_point: &MetricData
);
}
+fn store_response_time(cache: &RrdCache, remote_name: &str, response_time: f64, timestamp: i64) {
+ let name = format!("remotes/{remote_name}/metric-collection-response-time");
+
+ cache.update_value(&name, response_time, timestamp, DataSourceType::Gauge);
+}
+
#[cfg(test)]
mod tests {
use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
@@ -199,6 +219,8 @@ mod tests {
remote: "some-remote".into(),
metrics,
channel: tx_back,
+ response_time: 10.0,
+ request_at: now,
};
// Act
@@ -224,6 +246,15 @@ mod tests {
assert!(data.data.iter().any(Option::is_some));
}
+ if let Some(data) = cache.extract_data(
+ "remotes/some-remote",
+ "metric-collection-response-time",
+ RrdTimeframe::Hour,
+ RrdMode::Max,
+ )? {
+ assert!(data.data.iter().any(Option::is_some));
+ }
+
Ok(())
}
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 13/24] metric collection: save time needed for collection run to RRD
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (11 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 12/24] metric collection: record remote response time in metric database Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 14/24] metric collection: periodically clean removed remotes from statefile Lukas Wagner
` (11 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
For large setups, it might be useful to know how much time was needed to
collect metrics for *all* remotes together, e.g. for making sure that
the collection interval is not exceeded.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 17 ++++++
server/src/metric_collection/rrd_task.rs | 53 ++++++++++++++-----
2 files changed, 58 insertions(+), 12 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 51bcd78a..7605198f 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -17,6 +17,7 @@ use proxmox_sys::fs::CreateOptions;
use pdm_api_types::remotes::{Remote, RemoteType};
+use crate::metric_collection::rrd_task::CollectionStats;
use crate::{connection, task_utils};
use super::{
@@ -105,11 +106,27 @@ impl MetricCollectionTask {
log::debug!("starting metric collection from all remotes - triggered by timer");
if let Some(remotes) = Self::load_remote_config() {
+ let now = Instant::now();
let to_fetch = remotes
.iter()
.map(|(name, _)| name.into())
.collect::<Vec<String>>();
self.fetch_remotes(&remotes, &to_fetch).await;
+ let elapsed = now.elapsed();
+
+ if let Err(err) = self
+ .metric_data_tx
+ .send(RrdStoreRequest::CollectionStats {
+ timestamp: proxmox_time::epoch_i64(),
+ stats: CollectionStats {
+ // TODO: use as_millis_f64 once stabilized
+ total_time: elapsed.as_secs_f64() * 1000.,
+ },
+ })
+ .await
+ {
+ log::error!("could not send collection stats to rrd task: {err}");
+ }
}
}
diff --git a/server/src/metric_collection/rrd_task.rs b/server/src/metric_collection/rrd_task.rs
index 91927602..507d6b23 100644
--- a/server/src/metric_collection/rrd_task.rs
+++ b/server/src/metric_collection/rrd_task.rs
@@ -38,6 +38,13 @@ pub(super) enum RrdStoreRequest {
/// Timestamp at which the request was done (UNIX epoch).
request_at: i64,
},
+ /// Store collection stats.
+ CollectionStats {
+ /// Timestamp at which the collection took place (UNIX epoch).
+ timestamp: i64,
+ /// Statistics.
+ stats: CollectionStats,
+ },
}
/// Result for a [`RrdStoreRequest`].
@@ -46,6 +53,12 @@ pub(super) struct RrdStoreResult {
pub(super) most_recent_timestamp: i64,
}
+/// Statistics for a (full) metric collection run.
+pub(super) struct CollectionStats {
+ /// Total time in ms.
+ pub(super) total_time: f64,
+}
+
/// Task which stores received metrics in the RRD. Metric data is fed into
/// this task via a MPSC channel.
pub(super) async fn store_in_rrd_task(
@@ -57,7 +70,8 @@ pub(super) async fn store_in_rrd_task(
// Involves some blocking file IO
let res = tokio::task::spawn_blocking(move || {
let mut most_recent_timestamp = 0;
- let channel = match msg {
+
+ match msg {
RrdStoreRequest::Pve {
remote,
metrics,
@@ -71,7 +85,13 @@ pub(super) async fn store_in_rrd_task(
}
store_response_time(&cache_clone, &remote, response_time, request_at);
- channel
+ let result = RrdStoreResult {
+ most_recent_timestamp,
+ };
+
+ if channel.send(result).is_err() {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
}
RrdStoreRequest::Pbs {
remote,
@@ -86,17 +106,17 @@ pub(super) async fn store_in_rrd_task(
}
store_response_time(&cache_clone, &remote, response_time, request_at);
- channel
- }
- };
+ let result = RrdStoreResult {
+ most_recent_timestamp,
+ };
- if channel
- .send(RrdStoreResult {
- most_recent_timestamp,
- })
- .is_err()
- {
- log::error!("could not send RrdStoreStoreResult to metric collection task");
+ if channel.send(result).is_err() {
+ log::error!("could not send RrdStoreStoreResult to metric collection task");
+ };
+ }
+ RrdStoreRequest::CollectionStats { timestamp, stats } => {
+ store_stats(&cache_clone, &stats, timestamp)
+ }
};
})
.await;
@@ -157,6 +177,15 @@ fn store_response_time(cache: &RrdCache, remote_name: &str, response_time: f64,
cache.update_value(&name, response_time, timestamp, DataSourceType::Gauge);
}
+fn store_stats(cache: &RrdCache, stats: &CollectionStats, timestamp: i64) {
+ cache.update_value(
+ "nodes/localhost/metric-collection-total-time",
+ stats.total_time,
+ timestamp,
+ DataSourceType::Gauge,
+ );
+}
+
#[cfg(test)]
mod tests {
use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 14/24] metric collection: periodically clean removed remotes from statefile
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (12 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 13/24] metric collection: save time needed for collection run to RRD Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 15/24] api: add endpoint to trigger metric collection Lukas Wagner
` (10 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
Adding and removing remotes can leave leftover data in the statefile,
hence it makes sense to clean it up periodically.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
.../src/metric_collection/collection_task.rs | 6 ++++
server/src/metric_collection/state.rs | 32 +++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 7605198f..e06c919d 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -106,6 +106,8 @@ impl MetricCollectionTask {
log::debug!("starting metric collection from all remotes - triggered by timer");
if let Some(remotes) = Self::load_remote_config() {
+ self.cleanup_removed_remotes_from_state(&remotes);
+
let now = Instant::now();
let to_fetch = remotes
.iter()
@@ -150,6 +152,10 @@ impl MetricCollectionTask {
}
}
+ fn cleanup_removed_remotes_from_state(&mut self, remotes: &SectionConfigData<Remote>) {
+ self.state.retain(|remote| remotes.get(remote).is_some());
+ }
+
/// Set up a [`tokio::time::Interval`] instance with the provided interval.
/// The timer will be aligned, e.g. an interval of `60` will let the timer
/// fire at minute boundaries.
diff --git a/server/src/metric_collection/state.rs b/server/src/metric_collection/state.rs
index e76444a9..7f68843e 100644
--- a/server/src/metric_collection/state.rs
+++ b/server/src/metric_collection/state.rs
@@ -74,6 +74,11 @@ impl MetricCollectionState {
Ok(())
}
+ /// Retain all remotes for which the predicate holds.
+ pub fn retain(&mut self, check: impl Fn(&str) -> bool) {
+ self.state.remote_status.retain(|remote, _| check(remote));
+ }
+
fn load_or_default(path: &Path) -> Result<State, Error> {
let content = proxmox_sys::fs::file_read_optional_string(path)?;
@@ -115,4 +120,31 @@ mod tests {
Ok(())
}
+
+ #[test]
+ fn test_retain() -> Result<(), Error> {
+ let file = NamedTempFile::new(get_create_options())?;
+ let options = get_create_options();
+ let mut state = MetricCollectionState::new(file.path().into(), options);
+
+ state.set_status(
+ "remote-1".into(),
+ RemoteStatus {
+ ..Default::default()
+ },
+ );
+ state.set_status(
+ "remote-2".into(),
+ RemoteStatus {
+ ..Default::default()
+ },
+ );
+
+ state.retain(|remote| remote == "remote-1");
+
+ assert!(state.get_status("remote-1").is_some());
+ assert!(state.get_status("remote-2").is_none());
+
+ Ok(())
+ }
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 15/24] api: add endpoint to trigger metric collection
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (13 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 14/24] metric collection: periodically clean removed remotes from statefile Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 16/24] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
` (9 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This one lives under /metric-collection/trigger.
Doing a POST will trigger metric collection for a remote provided in the
'remote' parameter, or all remotes if the parameter is not provided.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/api/metric_collection.rs | 27 ++++++---------------------
server/src/api/mod.rs | 2 ++
2 files changed, 8 insertions(+), 21 deletions(-)
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 0658fb1f..308ceca2 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -1,26 +1,17 @@
use anyhow::Error;
+use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
use proxmox_router::{Router, SubdirMap};
use proxmox_schema::api;
use proxmox_sortable_macro::sortable;
-use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, MetricCollectionStatus};
-
-use crate::metric_collection;
-
pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
#[sortable]
-const SUBDIRS: SubdirMap = &sorted!([
- (
- "trigger",
- &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
- ),
- (
- "status",
- &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_STATUS)
- ),
-]);
+const SUBDIRS: SubdirMap = &sorted!([(
+ "trigger",
+ &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+),]);
#[api(
input: {
@@ -34,13 +25,7 @@ const SUBDIRS: SubdirMap = &sorted!([
)]
/// Trigger metric collection for a provided remote or for all remotes if no remote is passed.
pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
- crate::metric_collection::trigger_metric_collection(remote, false).await?;
+ crate::metric_collection::trigger_metric_collection(remote).await?;
Ok(())
}
-
-#[api]
-/// Read metric collection status.
-fn get_metric_collection_status() -> Result<Vec<MetricCollectionStatus>, Error> {
- metric_collection::get_status()
-}
diff --git a/server/src/api/mod.rs b/server/src/api/mod.rs
index 6c4831b4..ff875fc8 100644
--- a/server/src/api/mod.rs
+++ b/server/src/api/mod.rs
@@ -9,6 +9,7 @@ use proxmox_sortable_macro::sortable;
pub mod access;
pub mod config;
+pub mod metric_collection;
pub mod nodes;
pub mod pbs;
pub mod pve;
@@ -21,6 +22,7 @@ mod rrd_common;
const SUBDIRS: SubdirMap = &sorted!([
("access", &access::ROUTER),
("config", &config::ROUTER),
+ ("metric-collection", &metric_collection::ROUTER),
("ping", &Router::new().get(&API_METHOD_PING)),
("pve", &pve::ROUTER),
("pbs", &pbs::ROUTER),
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 16/24] api: remotes: trigger immediate metric collection for newly added nodes
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (14 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 15/24] api: add endpoint to trigger metric collection Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 17/24] api: add api for querying metric collection RRD data Lukas Wagner
` (8 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This makes sure that metric graphs are available right after adding a
new remote.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/api/remotes.rs | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index 8ff1ada0..b49228a9 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -18,6 +18,7 @@ use proxmox_time::{epoch_i64, epoch_to_rfc2822};
use pdm_api_types::remotes::{Remote, RemoteType, RemoteUpdater, REMOTE_ID_SCHEMA};
use pdm_api_types::{Authid, ConfigDigest, PRIV_RESOURCE_AUDIT, PRIV_RESOURCE_MODIFY};
+use crate::metric_collection;
use crate::{connection, pbs_client};
use super::pve;
@@ -171,10 +172,15 @@ pub async fn add_remote(mut entry: Remote, create_token: Option<String>) -> Resu
entry.token = token;
}
+ let name = entry.id.clone();
remotes.insert(entry.id.to_owned(), entry);
pdm_config::remotes::save_config(&remotes)?;
+ if let Err(e) = metric_collection::trigger_metric_collection(Some(name)).await {
+ log::error!("could not trigger metric collection after adding remote: {e}");
+ }
+
Ok(())
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 17/24] api: add api for querying metric collection RRD data
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (15 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 16/24] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 18/24] api: metric-collection: add status endpoint Lukas Wagner
` (7 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This commit adds two new API endpoints:
- remotes/{id}/rrddata
- nodes/localhost/rrddata
The first one returns graphable datapoints for the API response time
when fetching metrics, the second one datapoints about the total time
needed for collecting *all* remotes.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
lib/pdm-api-types/src/rrddata.rs | 26 ++++++++++++++
server/src/api/metric_collection.rs | 3 +-
server/src/api/nodes/mod.rs | 2 ++
server/src/api/nodes/rrddata.rs | 48 ++++++++++++++++++++++++++
server/src/api/remotes.rs | 53 +++++++++++++++++++++++++++++
5 files changed, 131 insertions(+), 1 deletion(-)
create mode 100644 server/src/api/nodes/rrddata.rs
diff --git a/lib/pdm-api-types/src/rrddata.rs b/lib/pdm-api-types/src/rrddata.rs
index a973977c..47847b72 100644
--- a/lib/pdm-api-types/src/rrddata.rs
+++ b/lib/pdm-api-types/src/rrddata.rs
@@ -216,3 +216,29 @@ pub struct PbsDatastoreDataPoint {
#[serde(skip_serializing_if = "Option::is_none")]
pub disk_write: Option<f64>,
}
+
+#[api]
+#[derive(Serialize, Deserialize, Default)]
+#[serde(rename_all = "kebab-case")]
+/// RRD datapoint for statistics about the metric collection loop.
+pub struct PdmNodeDatapoint {
+ /// Timestamp (UNIX epoch)
+ pub time: u64,
+
+ /// Total time in milliseconds needed for full metric collection run.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub total_time: Option<f64>,
+}
+
+#[api]
+#[derive(Serialize, Deserialize, Default)]
+#[serde(rename_all = "kebab-case")]
+/// RRD datapoint for a single remote.
+pub struct RemoteDatapoint {
+ /// Timestamp (UNIX epoch)
+ pub time: u64,
+
+ /// API response time in milliseconds when requesting the metrics from the remote.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub metric_collection_response_time: Option<f64>,
+}
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 308ceca2..02e5be22 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -1,10 +1,11 @@
use anyhow::Error;
-use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
use proxmox_router::{Router, SubdirMap};
use proxmox_schema::api;
use proxmox_sortable_macro::sortable;
+use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
+
pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
#[sortable]
diff --git a/server/src/api/nodes/mod.rs b/server/src/api/nodes/mod.rs
index 42da798a..6f30ba7e 100644
--- a/server/src/api/nodes/mod.rs
+++ b/server/src/api/nodes/mod.rs
@@ -9,6 +9,7 @@ pub mod config;
pub mod dns;
pub mod journal;
pub mod network;
+pub mod rrddata;
pub mod syslog;
pub mod tasks;
pub mod termproxy;
@@ -43,6 +44,7 @@ pub const SUBDIRS: SubdirMap = &sorted!([
("dns", &dns::ROUTER),
("journal", &journal::ROUTER),
("network", &network::ROUTER),
+ ("rrdata", &rrddata::ROUTER),
("syslog", &syslog::ROUTER),
("tasks", &tasks::ROUTER),
("termproxy", &termproxy::ROUTER),
diff --git a/server/src/api/nodes/rrddata.rs b/server/src/api/nodes/rrddata.rs
new file mode 100644
index 00000000..de7f5a52
--- /dev/null
+++ b/server/src/api/nodes/rrddata.rs
@@ -0,0 +1,48 @@
+use anyhow::Error;
+use proxmox_rrd_api_types::{RrdMode, RrdTimeframe};
+
+use proxmox_router::Router;
+use proxmox_schema::api;
+
+use pdm_api_types::rrddata::PdmNodeDatapoint;
+
+use crate::api::rrd_common::{self, DataPoint};
+
+impl DataPoint for PdmNodeDatapoint {
+ fn new(time: u64) -> Self {
+ Self {
+ time,
+ ..Default::default()
+ }
+ }
+
+ fn fields() -> &'static [&'static str] {
+ &["metric-collection-total-time"]
+ }
+
+ fn set_field(&mut self, name: &str, value: f64) {
+ if name == "metric-collection-total-time" {
+ self.total_time = Some(value);
+ }
+ }
+}
+
+#[api(
+ input: {
+ properties: {
+ timeframe: {
+ type: RrdTimeframe,
+ },
+ cf: {
+ type: RrdMode,
+ },
+ },
+ },
+)]
+/// Read RRD data for this PDM node.
+fn get_node_rrddata(timeframe: RrdTimeframe, cf: RrdMode) -> Result<Vec<PdmNodeDatapoint>, Error> {
+ let base = "nodes/localhost";
+ rrd_common::create_datapoints_from_rrd(base, timeframe, cf)
+}
+
+pub const ROUTER: Router = Router::new().get(&API_METHOD_GET_NODE_RRDDATA);
diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index b49228a9..c2489e60 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -9,6 +9,8 @@ use proxmox_access_control::CachedUserInfo;
use proxmox_router::{
http_bail, http_err, list_subdirs_api_method, Permission, Router, RpcEnvironment, SubdirMap,
};
+use proxmox_rrd_api_types::RrdMode;
+use proxmox_rrd_api_types::RrdTimeframe;
use proxmox_schema::api;
use proxmox_schema::Schema;
use proxmox_section_config::typed::SectionConfigData;
@@ -16,12 +18,15 @@ use proxmox_sortable_macro::sortable;
use proxmox_time::{epoch_i64, epoch_to_rfc2822};
use pdm_api_types::remotes::{Remote, RemoteType, RemoteUpdater, REMOTE_ID_SCHEMA};
+use pdm_api_types::rrddata::RemoteDatapoint;
use pdm_api_types::{Authid, ConfigDigest, PRIV_RESOURCE_AUDIT, PRIV_RESOURCE_MODIFY};
use crate::metric_collection;
use crate::{connection, pbs_client};
use super::pve;
+use super::rrd_common;
+use super::rrd_common::DataPoint;
pub const ROUTER: Router = Router::new()
.get(&API_METHOD_LIST_REMOTES)
@@ -38,6 +43,10 @@ const ITEM_ROUTER: Router = Router::new()
const SUBDIRS: SubdirMap = &sorted!([
("config", &Router::new().get(&API_METHOD_REMOTE_CONFIG)),
("version", &Router::new().get(&API_METHOD_VERSION)),
+ (
+ "rrddata",
+ &Router::new().get(&API_METHOD_GET_PER_REMOTE_RRD_DATA)
+ ),
]);
pub fn get_remote<'a>(
@@ -331,3 +340,47 @@ pub fn remote_config(id: String) -> Result<Remote, Error> {
remote.token = String::new(); // mask token in response
Ok(remote.clone())
}
+
+impl DataPoint for RemoteDatapoint {
+ fn new(time: u64) -> Self {
+ Self {
+ time,
+ ..Default::default()
+ }
+ }
+
+ fn fields() -> &'static [&'static str] {
+ &["metric-collection-response-time"]
+ }
+
+ fn set_field(&mut self, name: &str, value: f64) {
+ if name == "metric-collection-response-time" {
+ self.metric_collection_response_time = Some(value);
+ }
+ }
+}
+
+#[api(
+ input: {
+ properties: {
+ id: {
+ schema: REMOTE_ID_SCHEMA,
+ },
+ timeframe: {
+ type: RrdTimeframe,
+ },
+ cf: {
+ type: RrdMode,
+ },
+ },
+ },
+)]
+/// Read metric collection RRD data.
+fn get_per_remote_rrd_data(
+ id: String,
+ timeframe: RrdTimeframe,
+ cf: RrdMode,
+) -> Result<Vec<RemoteDatapoint>, Error> {
+ let base = format!("remotes/{id}");
+ rrd_common::create_datapoints_from_rrd(&base, timeframe, cf)
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 18/24] api: metric-collection: add status endpoint
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (16 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 17/24] api: add api for querying metric collection RRD data Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 19/24] pdm-client: add metric collection API methods Lukas Wagner
` (6 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This adds a new API endpoint at /metric-collection/status, returning the
status of the last metric collection attempt from each remote.
For now, this contains the timestamp (last-collected) and any
error that occurred (error).
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
Notes:
Changes since v2:
- use `_` for type casts
- Drop max-concurrency, and delay settings for now
lib/pdm-api-types/src/lib.rs | 3 +++
lib/pdm-api-types/src/metric_collection.rs | 20 ++++++++++++++++++
server/src/api/metric_collection.rs | 24 +++++++++++++++++-----
server/src/metric_collection/mod.rs | 21 +++++++++++++++++++
4 files changed, 63 insertions(+), 5 deletions(-)
create mode 100644 lib/pdm-api-types/src/metric_collection.rs
diff --git a/lib/pdm-api-types/src/lib.rs b/lib/pdm-api-types/src/lib.rs
index 37da134c..8dbacba9 100644
--- a/lib/pdm-api-types/src/lib.rs
+++ b/lib/pdm-api-types/src/lib.rs
@@ -19,6 +19,9 @@ pub use acl::*;
mod node_config;
pub use node_config::*;
+mod metric_collection;
+pub use metric_collection::*;
+
mod proxy;
pub use proxy::HTTP_PROXY_SCHEMA;
diff --git a/lib/pdm-api-types/src/metric_collection.rs b/lib/pdm-api-types/src/metric_collection.rs
new file mode 100644
index 00000000..5279c8a4
--- /dev/null
+++ b/lib/pdm-api-types/src/metric_collection.rs
@@ -0,0 +1,20 @@
+//! API types for metric collection.
+
+use serde::{Deserialize, Serialize};
+
+use proxmox_schema::api;
+
+#[api]
+#[derive(Clone, Deserialize, Serialize)]
+#[serde(rename_all = "kebab-case")]
+/// Per-remote collection status.
+pub struct MetricCollectionStatus {
+ /// The remote's name.
+ pub remote: String,
+ /// Any error that occured during the last collection attempt.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub error: Option<String>,
+ /// Timestamp of last successful collection.
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub last_collection: Option<i64>,
+}
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 02e5be22..845cc0e6 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -4,15 +4,23 @@ use proxmox_router::{Router, SubdirMap};
use proxmox_schema::api;
use proxmox_sortable_macro::sortable;
-use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
+use pdm_api_types::{remotes::REMOTE_ID_SCHEMA, MetricCollectionStatus};
+
+use crate::metric_collection;
pub const ROUTER: Router = Router::new().subdirs(SUBDIRS);
#[sortable]
-const SUBDIRS: SubdirMap = &sorted!([(
- "trigger",
- &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
-),]);
+const SUBDIRS: SubdirMap = &sorted!([
+ (
+ "trigger",
+ &Router::new().post(&API_METHOD_TRIGGER_METRIC_COLLECTION)
+ ),
+ (
+ "status",
+ &Router::new().get(&API_METHOD_GET_METRIC_COLLECTION_STATUS)
+ ),
+]);
#[api(
input: {
@@ -30,3 +38,9 @@ pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Err
Ok(())
}
+
+#[api]
+/// Read metric collection status.
+fn get_metric_collection_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+ metric_collection::get_status()
+}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 41772a95..2ddffda5 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -6,6 +6,7 @@ use anyhow::{bail, Error};
use nix::sys::stat::Mode;
use tokio::sync::mpsc::{self, Sender};
+use pdm_api_types::MetricCollectionStatus;
use pdm_buildcfg::PDM_STATE_DIR_M;
mod collection_task;
@@ -78,3 +79,23 @@ pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Err
Ok(())
}
+
+/// Get each remote's metric collection status.
+pub fn get_status() -> Result<Vec<MetricCollectionStatus>, Error> {
+ let (remotes, _) = pdm_config::remotes::config()?;
+ let state = collection_task::load_state()?;
+
+ let mut result = Vec::new();
+
+ for (remote, _) in remotes.into_iter() {
+ if let Some(status) = state.get_status(&remote) {
+ result.push(MetricCollectionStatus {
+ remote,
+ error: status.error.clone(),
+ last_collection: status.last_collection,
+ })
+ }
+ }
+
+ Ok(result)
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 19/24] pdm-client: add metric collection API methods
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (17 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 18/24] api: metric-collection: add status endpoint Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 20/24] cli: add commands for metric-collection trigger and status Lukas Wagner
` (5 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This adds bindings for collection settings, trigger, and status APIs.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
lib/pdm-client/src/lib.rs | 56 +++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/lib/pdm-client/src/lib.rs b/lib/pdm-client/src/lib.rs
index 767ac873..f2bb546a 100644
--- a/lib/pdm-client/src/lib.rs
+++ b/lib/pdm-client/src/lib.rs
@@ -318,6 +318,62 @@ impl<T: HttpApiClient> PdmClient<T> {
.ok_or_else(|| Error::BadApi("api returned no webauthn entry id".to_string(), None))
}
+ /// Trigger metric collection for a single remote or for all remotes, if no remote is provided.
+ pub async fn trigger_metric_collection(
+ &self,
+ remote: Option<&str>,
+ ) -> Result<(), proxmox_client::Error> {
+ let path = "/api2/extjs/metric-collection/trigger";
+
+ #[derive(Serialize)]
+ struct TriggerParams<'a> {
+ #[serde(skip_serializing_if = "Option::is_none")]
+ remote: Option<&'a str>,
+ }
+
+ self.0
+ .post(path, &TriggerParams { remote })
+ .await?
+ .nodata()?;
+
+ Ok(())
+ }
+
+ /// Get global metric collection status.
+ pub async fn get_metric_collection_status(
+ &self,
+ ) -> Result<Vec<pdm_api_types::MetricCollectionStatus>, Error> {
+ let path = "/api2/extjs/metric-collection/status";
+ Ok(self.0.get(path).await?.expect_json()?.data)
+ }
+
+ /// Get PDM node RRD data.
+ pub async fn get_pdm_node_rrddata(
+ &self,
+ mode: RrdMode,
+ timeframe: RrdTimeframe,
+ ) -> Result<pdm_api_types::rrddata::PdmNodeDatapoint, Error> {
+ let path = ApiPathBuilder::new("/api2/extjs/nodes/localhost/rrddata")
+ .arg("cf", mode)
+ .arg("timeframe", timeframe)
+ .build();
+ Ok(self.0.get(&path).await?.expect_json()?.data)
+ }
+
+ /// Get per-remote RRD data.
+ pub async fn get_per_remote_rrddata(
+ &self,
+ remote: &str,
+ mode: RrdMode,
+ timeframe: RrdTimeframe,
+ ) -> Result<pdm_api_types::rrddata::RemoteDatapoint, Error> {
+ let path = ApiPathBuilder::new(format!("/api2/extjs/remotes/{remote}/rrddata"))
+ .arg("cf", mode)
+ .arg("timeframe", timeframe)
+ .build();
+ Ok(self.0.get(&path).await?.expect_json()?.data)
+ }
+
pub async fn pve_list_nodes(
&self,
remote: &str,
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 20/24] cli: add commands for metric-collection trigger and status
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (18 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 19/24] pdm-client: add metric collection API methods Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 21/24] metric collection: skip missed timer ticks Lukas Wagner
` (4 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This adds new commands to the proxmox-datacenter-client CLI tool, namely
- trigger metric collection
- show status of the last metric collection
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
Notes:
Changes since v1:
- update-settings -> 'settings update'
- show-settings -> 'settings show'
New in v2.
cli/client/Cargo.toml | 1 +
cli/client/src/main.rs | 2 +
cli/client/src/metric_collection.rs | 70 +++++++++++++++++++++++++++++
3 files changed, 73 insertions(+)
create mode 100644 cli/client/src/metric_collection.rs
diff --git a/cli/client/Cargo.toml b/cli/client/Cargo.toml
index 3de3bf3f..c2aac874 100644
--- a/cli/client/Cargo.toml
+++ b/cli/client/Cargo.toml
@@ -46,6 +46,7 @@ proxmox-rrd-api-types.workspace = true
proxmox-schema = { workspace = true, features = [ "api-macro" ] }
proxmox-section-config.workspace = true
proxmox-sys.workspace = true
+proxmox-time.workspace = true
proxmox-tfa = { workspace = true, features = [ "types" ] }
# for the pve API types:
diff --git a/cli/client/src/main.rs b/cli/client/src/main.rs
index b1cc2ad5..27fd3744 100644
--- a/cli/client/src/main.rs
+++ b/cli/client/src/main.rs
@@ -17,6 +17,7 @@ pub mod env;
pub mod acl;
pub mod config;
+pub mod metric_collection;
pub mod pbs;
pub mod pve;
pub mod remotes;
@@ -96,6 +97,7 @@ fn main_do() -> Result<(), Error> {
)
.insert("acl", acl::cli())
.insert("login", CliCommand::new(&API_METHOD_LOGIN))
+ .insert("metric-collection", metric_collection::cli())
.insert("pbs", pbs::cli())
.insert("pve", pve::cli())
.insert("remote", remotes::cli())
diff --git a/cli/client/src/metric_collection.rs b/cli/client/src/metric_collection.rs
new file mode 100644
index 00000000..e9dbd804
--- /dev/null
+++ b/cli/client/src/metric_collection.rs
@@ -0,0 +1,70 @@
+use anyhow::Error;
+use pdm_api_types::remotes::REMOTE_ID_SCHEMA;
+use proxmox_router::cli::{
+ format_and_print_result, CliCommand, CliCommandMap, CommandLineInterface, OutputFormat,
+};
+use proxmox_schema::api;
+
+use crate::{client, env};
+
+pub fn cli() -> CommandLineInterface {
+ CliCommandMap::new()
+ .insert(
+ "trigger",
+ CliCommand::new(&API_METHOD_TRIGGER_METRIC_COLLECTION),
+ )
+ .insert(
+ "status",
+ CliCommand::new(&API_METHOD_METRIC_COLLECTION_STATUS),
+ )
+ .into()
+}
+
+#[api(
+ input: {
+ properties: {
+ remote: {
+ schema: REMOTE_ID_SCHEMA,
+ optional: true,
+ },
+ }
+ }
+)]
+/// Trigger metric collection. If a remote is passed, only this remote will be collected, otherwise
+/// all.
+async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+ client()?
+ .trigger_metric_collection(remote.as_deref())
+ .await?;
+ Ok(())
+}
+
+#[api]
+/// Show metric collection status.
+async fn metric_collection_status() -> Result<(), Error> {
+ let result = client()?.get_metric_collection_status().await?;
+
+ let output_format = env().format_args.output_format;
+ if output_format == OutputFormat::Text {
+ for remote_status in result {
+ let timestamp = if let Some(last_collection) = remote_status.last_collection {
+ proxmox_time::strftime_local("%a, %d %b %Y %T %z", last_collection)?
+ } else {
+ "never".into()
+ };
+
+ let status = if let Some(err) = &remote_status.error {
+ err
+ } else {
+ "ok"
+ };
+
+ println!("{}: {status}", remote_status.remote);
+ println!(" last successful: {timestamp}");
+ println!();
+ }
+ } else {
+ format_and_print_result(&result, &output_format.to_string());
+ }
+ Ok(())
+}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 21/24] metric collection: skip missed timer ticks
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (19 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 20/24] cli: add commands for metric-collection trigger and status Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 22/24] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
` (3 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
The default behavior of `tokio::time::Interval` when a tick is missed
is 'burst', which is to shorten the timer intervals until the original
alignment is restored, while keeping the *number of ticks* the same
the same as expected.
Example from the official tokio docs [1] for burst mode:
Expected: | 1 | 2 | 3 | 4 | 5 | 6 |
Actual: | work ---| delay |work|work|work-|work-----|
For metric collection, we do not really gain anything from bursting
missed ticks. For us, 'skip' is fine. There, the alignment is
immediately restored by skipping any ticks that have been missed:
Expected: | 1 | 2 | 3 | 4 | 5 | 6 |
Actual: | work ---| delay |work---| work----| work----|
https://docs.rs/tokio/latest/tokio/time/enum.MissedTickBehavior.html
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
Notes:
New in v2.
server/src/metric_collection/collection_task.rs | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index e06c919d..28687111 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -9,7 +9,7 @@ use tokio::{
mpsc::{Receiver, Sender},
oneshot, OwnedSemaphorePermit, Semaphore,
},
- time::Interval,
+ time::{Interval, MissedTickBehavior},
};
use proxmox_section_config::typed::SectionConfigData;
@@ -165,6 +165,13 @@ impl MetricCollectionTask {
fn setup_timer(interval: u64) -> (Interval, Instant) {
log::debug!("setting metric collection interval timer to {interval} seconds.",);
let mut timer = tokio::time::interval(Duration::from_secs(interval));
+
+ // If we miss a tick because a previous collection run took too long, we want to
+ // tick as soon as possible, but we do not need to repeat missing ticks.
+ // We do want to stay aligned, though.
+ // https://docs.rs/tokio/latest/tokio/time/enum.MissedTickBehavior.html#variant.Skip
+ timer.set_missed_tick_behavior(MissedTickBehavior::Skip);
+
let first_run = task_utils::next_aligned_instant(interval);
timer.reset_at(first_run.into());
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 22/24] metric collection: use JoinSet instead of joining from handles in a Vec
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (20 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 21/24] metric collection: skip missed timer ticks Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 23/24] metric collection: allow to wait until completion when triggering collection manually Lukas Wagner
` (2 subsequent siblings)
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This lets us process finished tasks in the order they finish, not in the
order they were spawned.
Suggested-by: Wolfang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
Reviewed-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
---
Notes:
New in v2.
.../src/metric_collection/collection_task.rs | 25 ++++++++-----------
1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 28687111..1e23fa88 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -9,6 +9,7 @@ use tokio::{
mpsc::{Receiver, Sender},
oneshot, OwnedSemaphorePermit, Semaphore,
},
+ task::JoinSet,
time::{Interval, MissedTickBehavior},
};
@@ -200,7 +201,8 @@ impl MetricCollectionTask {
remotes_to_fetch: &[String],
) {
let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT_CONNECTIONS));
- let mut handles = Vec::new();
+ let mut handles = JoinSet::new();
+
let now = proxmox_time::epoch_i64();
for remote_name in remotes_to_fetch {
@@ -223,29 +225,22 @@ impl MetricCollectionTask {
if let Some(remote) = remote_config.get(remote_name).cloned() {
log::debug!("fetching remote '{}'", remote.id);
- let handle = tokio::spawn(Self::fetch_single_remote(
+ handles.spawn(Self::fetch_single_remote(
remote,
status,
self.metric_data_tx.clone(),
permit,
));
-
- handles.push((remote_name.clone(), handle));
}
}
- for (remote_name, handle) in handles {
- let res = handle.await;
-
+ while let Some(res) = handles.join_next().await {
match res {
- Ok(Ok(ts)) => {
- self.state.set_status(remote_name, ts);
+ Ok((name, status)) => {
+ self.state.set_status(name, status);
}
- Ok(Err(err)) => log::error!("failed to collect metrics for {remote_name}: {err}"),
Err(err) => {
- log::error!(
- "join error for metric collection task for remote {remote_name}: {err}"
- )
+ log::error!("join error for metric collection task for remote: {err}")
}
}
}
@@ -292,7 +287,7 @@ impl MetricCollectionTask {
mut status: RemoteStatus,
sender: Sender<RrdStoreRequest>,
_permit: OwnedSemaphorePermit,
- ) -> Result<RemoteStatus, Error> {
+ ) -> (String, RemoteStatus) {
let (result_tx, result_rx) = oneshot::channel();
let now = proxmox_time::epoch_i64();
@@ -360,7 +355,7 @@ impl MetricCollectionTask {
}
}
- Ok(status)
+ (remote.id, status)
}
}
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 23/24] metric collection: allow to wait until completion when triggering collection manually
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (21 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 22/24] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 24/24] api: pve: rrd: trigger and wait for metric collection when requesting RRD data Lukas Wagner
2025-08-28 19:37 ` [pdm-devel] applied: [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Thomas Lamprecht
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
This allows us to request the latest metrics for a single remote in the
rrddata API calls, closing the gap in data the results from the longer
10min poll interval.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
New in v7.
server/src/api/metric_collection.rs | 2 +-
server/src/api/remotes.rs | 2 +-
.../src/metric_collection/collection_task.rs | 16 +++++++++++-----
server/src/metric_collection/mod.rs | 18 +++++++++++++++---
4 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/server/src/api/metric_collection.rs b/server/src/api/metric_collection.rs
index 845cc0e6..0658fb1f 100644
--- a/server/src/api/metric_collection.rs
+++ b/server/src/api/metric_collection.rs
@@ -34,7 +34,7 @@ const SUBDIRS: SubdirMap = &sorted!([
)]
/// Trigger metric collection for a provided remote or for all remotes if no remote is passed.
pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
- crate::metric_collection::trigger_metric_collection(remote).await?;
+ crate::metric_collection::trigger_metric_collection(remote, false).await?;
Ok(())
}
diff --git a/server/src/api/remotes.rs b/server/src/api/remotes.rs
index c2489e60..033aa7c9 100644
--- a/server/src/api/remotes.rs
+++ b/server/src/api/remotes.rs
@@ -186,7 +186,7 @@ pub async fn add_remote(mut entry: Remote, create_token: Option<String>) -> Resu
pdm_config::remotes::save_config(&remotes)?;
- if let Err(e) = metric_collection::trigger_metric_collection(Some(name)).await {
+ if let Err(e) = metric_collection::trigger_metric_collection(Some(name), false).await {
log::error!("could not trigger metric collection after adding remote: {e}");
}
diff --git a/server/src/metric_collection/collection_task.rs b/server/src/metric_collection/collection_task.rs
index 1e23fa88..5f67d65d 100644
--- a/server/src/metric_collection/collection_task.rs
+++ b/server/src/metric_collection/collection_task.rs
@@ -41,7 +41,7 @@ pub const MIN_COLLECTION_INTERVAL: u64 = 10;
/// Control messages for the metric collection task.
pub(super) enum ControlMsg {
- TriggerMetricCollection(Option<String>),
+ TriggerMetricCollection(Option<String>, oneshot::Sender<()>),
}
/// Task which periodically collects metrics from all remotes and stores
@@ -136,20 +136,26 @@ impl MetricCollectionTask {
/// Handle a control message for force-triggered collection.
async fn handle_control_message(&mut self, message: ControlMsg) {
if let Some(remotes) = Self::load_remote_config() {
- match message {
- ControlMsg::TriggerMetricCollection(Some(remote)) => {
+ let done_tx = match message {
+ ControlMsg::TriggerMetricCollection(Some(remote), done_tx) => {
log::debug!("starting metric collection for remote '{remote}'- triggered by control message");
self.fetch_remotes(&remotes, &[remote]).await;
+ done_tx
}
- ControlMsg::TriggerMetricCollection(None) => {
+ ControlMsg::TriggerMetricCollection(None, done_tx) => {
log::debug!("starting metric collection from all remotes - triggered by control message");
let to_fetch = remotes
.iter()
.map(|(name, _)| name.into())
.collect::<Vec<String>>();
self.fetch_remotes(&remotes, &to_fetch).await;
+ done_tx
}
- }
+ };
+
+ // We don't care about the result, if the caller does not wait for the result, it
+ // might have dropped the receiver already.
+ let _ = done_tx.send(());
}
}
diff --git a/server/src/metric_collection/mod.rs b/server/src/metric_collection/mod.rs
index 2ddffda5..0e6860fc 100644
--- a/server/src/metric_collection/mod.rs
+++ b/server/src/metric_collection/mod.rs
@@ -5,6 +5,7 @@ use std::sync::OnceLock;
use anyhow::{bail, Error};
use nix::sys::stat::Mode;
use tokio::sync::mpsc::{self, Sender};
+use tokio::sync::oneshot;
use pdm_api_types::MetricCollectionStatus;
use pdm_buildcfg::PDM_STATE_DIR_M;
@@ -66,15 +67,26 @@ pub fn start_task() -> Result<(), Error> {
Ok(())
}
-/// Schedule metric collection for a given remote as soon as possible.
+/// Schedule metric collection as soon as possible.
+///
+/// If `remote` is `Some(String)`, then the remote with the given ID is
+/// collected. If remote is `None`, all remotes are scheduled for collection.
+/// If `wait` is `true`, this function waits for the completion of the requested
+/// metric collection run.
///
/// Has no effect if the tx end of the channel has not been initialized yet.
/// Returns an error if the mpsc channel has been closed already.
-pub async fn trigger_metric_collection(remote: Option<String>) -> Result<(), Error> {
+pub async fn trigger_metric_collection(remote: Option<String>, wait: bool) -> Result<(), Error> {
+ let (done_sender, done_receiver) = oneshot::channel();
+
if let Some(sender) = CONTROL_MESSAGE_TX.get() {
sender
- .send(ControlMsg::TriggerMetricCollection(remote))
+ .send(ControlMsg::TriggerMetricCollection(remote, done_sender))
.await?;
+
+ if wait {
+ done_receiver.await?;
+ }
}
Ok(())
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] [PATCH proxmox-datacenter-manager v7 24/24] api: pve: rrd: trigger and wait for metric collection when requesting RRD data
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (22 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 23/24] metric collection: allow to wait until completion when triggering collection manually Lukas Wagner
@ 2025-08-26 13:51 ` Lukas Wagner
2025-08-28 19:37 ` [pdm-devel] applied: [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Thomas Lamprecht
24 siblings, 0 replies; 26+ messages in thread
From: Lukas Wagner @ 2025-08-26 13:51 UTC (permalink / raw)
To: pdm-devel
Since we now default to a much longer collection interval (10 min), the hourly
RRD data might have a noticable gap an the end. So circumvent this, we
now trigger metric collection for a single remote when requesting
hourly RRD data, waiting for the completion of metric collection up to a
short timeout of 5 seconds. If the timeout expires, which can happen in
the metric collection for this remote is particularly slow (bad
connection), or if the metric collection task is currently busy with a
full-run that's taking a long time, we simply return the data that we
already have.
Signed-off-by: Lukas Wagner <l.wagner@proxmox.com>
---
Notes:
New in v7.
server/src/api/pve/rrddata.rs | 43 +++++++++++++++++++++++++++--------
1 file changed, 34 insertions(+), 9 deletions(-)
diff --git a/server/src/api/pve/rrddata.rs b/server/src/api/pve/rrddata.rs
index b16c2313..b6a04037 100644
--- a/server/src/api/pve/rrddata.rs
+++ b/server/src/api/pve/rrddata.rs
@@ -1,3 +1,5 @@
+use std::time::Duration;
+
use anyhow::Error;
use serde_json::Value;
@@ -10,6 +12,7 @@ use pdm_api_types::rrddata::{LxcDataPoint, NodeDataPoint, QemuDataPoint};
use pdm_api_types::{NODE_SCHEMA, PRIV_RESOURCE_AUDIT, VMID_SCHEMA};
use crate::api::rrd_common::{self, DataPoint};
+use crate::metric_collection;
impl DataPoint for NodeDataPoint {
fn new(time: u64) -> Self {
@@ -161,7 +164,7 @@ impl DataPoint for LxcDataPoint {
},
)]
/// Read qemu stats
-fn get_qemu_rrd_data(
+async fn get_qemu_rrd_data(
remote: String,
vmid: u32,
timeframe: RrdTimeframe,
@@ -169,8 +172,7 @@ fn get_qemu_rrd_data(
_param: Value,
) -> Result<Vec<QemuDataPoint>, Error> {
let base = format!("pve/{remote}/qemu/{vmid}");
-
- rrd_common::create_datapoints_from_rrd(&base, timeframe, cf)
+ get_rrd_datapoints(remote, base, timeframe, cf).await
}
#[api(
@@ -191,7 +193,7 @@ fn get_qemu_rrd_data(
},
)]
/// Read lxc stats
-fn get_lxc_rrd_data(
+async fn get_lxc_rrd_data(
remote: String,
vmid: u32,
timeframe: RrdTimeframe,
@@ -199,8 +201,7 @@ fn get_lxc_rrd_data(
_param: Value,
) -> Result<Vec<LxcDataPoint>, Error> {
let base = format!("pve/{remote}/lxc/{vmid}");
-
- rrd_common::create_datapoints_from_rrd(&base, timeframe, cf)
+ get_rrd_datapoints(remote, base, timeframe, cf).await
}
#[api(
@@ -221,7 +222,7 @@ fn get_lxc_rrd_data(
},
)]
/// Read node stats
-fn get_node_rrd_data(
+async fn get_node_rrd_data(
remote: String,
node: String,
timeframe: RrdTimeframe,
@@ -229,9 +230,33 @@ fn get_node_rrd_data(
_param: Value,
) -> Result<Vec<NodeDataPoint>, Error> {
let base = format!("pve/{remote}/node/{node}");
-
- rrd_common::create_datapoints_from_rrd(&base, timeframe, cf)
+ get_rrd_datapoints(remote, base, timeframe, cf).await
}
+
+async fn get_rrd_datapoints<T: DataPoint + Send + 'static>(
+ remote: String,
+ basepath: String,
+ timeframe: RrdTimeframe,
+ mode: RrdMode,
+) -> Result<Vec<T>, Error> {
+ const WAIT_FOR_NEWEST_METRIC_TIMEOUT: Duration = Duration::from_secs(5);
+
+ if timeframe == RrdTimeframe::Hour {
+ // Let's wait for a limited time for the most recent metrics. If the connection to the remote
+ // is super slow or if the metric collection tasks currently busy with collecting other
+ // metrics, we just return the data we already have, not the newest one.
+ let _ = tokio::time::timeout(WAIT_FOR_NEWEST_METRIC_TIMEOUT, async {
+ metric_collection::trigger_metric_collection(Some(remote), true).await
+ })
+ .await;
+ }
+
+ tokio::task::spawn_blocking(move || {
+ rrd_common::create_datapoints_from_rrd(&basepath, timeframe, mode)
+ })
+ .await?
+}
+
pub const QEMU_RRD_ROUTER: Router = Router::new().get(&API_METHOD_GET_QEMU_RRD_DATA);
pub const LXC_RRD_ROUTER: Router = Router::new().get(&API_METHOD_GET_LXC_RRD_DATA);
pub const NODE_RRD_ROUTER: Router = Router::new().get(&API_METHOD_GET_NODE_RRD_DATA);
--
2.47.2
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
* [pdm-devel] applied: [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI)
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
` (23 preceding siblings ...)
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 24/24] api: pve: rrd: trigger and wait for metric collection when requesting RRD data Lukas Wagner
@ 2025-08-28 19:37 ` Thomas Lamprecht
24 siblings, 0 replies; 26+ messages in thread
From: Thomas Lamprecht @ 2025-08-28 19:37 UTC (permalink / raw)
To: pdm-devel, Lukas Wagner
On Tue, 26 Aug 2025 15:50:55 +0200, Lukas Wagner wrote:
> Key points:
> - fetch metrics concurrently
> - Add some tests for the core logic in the metric collection system
> - Allow to trigger metric collection via the API
> - Record metric collection statistics in the RRD
> - overall collection time for all remotes
> - per remote response time when fetching metrics
> - Persist metric collection state to disk:
> /var/lib/proxmox-datacenter-manager/metric-collection-state.json
> (timestamps of last collection, errors)
> - Trigger metric collection for any new remotes added via the API
>
> [...]
Applied, thanks!
[01/24] metric collection: split top_entities split into separate module
commit: 538fc07e95c39ee643c1c8498361e3592e3f42d0
[02/24] metric collection: save metric data to RRD in separate task
commit: 633cada6b3e1d8e86a0e91c8e89bac6a3e9ea5db
[03/24] metric collection: rework metric poll task
commit: 397574e3073e37351d01eed6ac363fcf8e571d15
[04/24] metric collection: persist state after metric collection
commit: 4b9468fee556aabcb446f38fa86e1b737de565b1
[05/24] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL
commit: 8ba70d487a8f0b5489fb489cf92431e1c42f056a
[06/24] metric collection: collect overdue metrics on startup/timer change
commit: 28268ed6e318ef21001e68ceb1f381e83ea19d6a
[07/24] metric collection: add tests for the fetch_remotes function
commit: 64dfe3fc69632dc7a546ccfba3ff35296dd9e50e
[08/24] metric collection: add test for fetch_overdue
commit: b5aa756b9730290c5e37d33dd97a9468f14ec408
[09/24] metric collection: pass rrd cache instance as function parameter
commit: 14c93fa2909946824a9b08d26e1eab7b7aa53ef6
[10/24] metric collection: add test for rrd task
commit: 80aa0071966eb8f7b7f770d9d942a380419ca452
[11/24] metric collection: wrap rrd_cache::Cache in a struct
commit: 497c5d7efc3d1e165d9ee8aef5a32f066db19a74
[12/24] metric collection: record remote response time in metric database
commit: 235a79507a06588e0173c48090a342a1f93575d6
[13/24] metric collection: save time needed for collection run to RRD
commit: e160d4ed2bc2d6ec5dba48697d6729ee9192e11b
[14/24] metric collection: periodically clean removed remotes from statefile
commit: ed18c8388a21a3df98c939546dba5c4bcfa15961
[15/24] api: add endpoint to trigger metric collection
commit: e3a6bccdea013ad5f1f2756d818f38d541d810eb
[16/24] api: remotes: trigger immediate metric collection for newly added nodes
commit: 51f93b45bc6aa1873d00df709b4a7645ec134345
[17/24] api: add api for querying metric collection RRD data
commit: 9b01a3822a80490e6c830f613d5c0fbe5a4a958b
[18/24] api: metric-collection: add status endpoint
commit: 07e2cdf2bc10521d4d2608596a6e44a9c728024b
[19/24] pdm-client: add metric collection API methods
commit: ae60329b68c57b7820be680861b914dd38458b45
[20/24] cli: add commands for metric-collection trigger and status
commit: 7b37bbe1a2ba3cae243779fc833357f42bc92229
[21/24] metric collection: skip missed timer ticks
commit: 7030ad0a8ff011d8eb40af4e95e20b5c5210d60a
[22/24] metric collection: use JoinSet instead of joining from handles in a Vec
commit: 075002688de2f7062ebf1333e7f5b266ed55ff44
[23/24] metric collection: allow to wait until completion when triggering collection manually
commit: 1f9cf78be132a2004b15ce944df2dff648b4bf7d
[24/24] api: pve: rrd: trigger and wait for metric collection when requesting RRD data
commit: a6df09b101d5f091f4f5a9058f99ae364629916a
_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2025-08-28 19:38 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-26 13:50 [pdm-devel] [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 01/24] metric collection: split top_entities split into separate module Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 02/24] metric collection: save metric data to RRD in separate task Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 03/24] metric collection: rework metric poll task Lukas Wagner
2025-08-26 13:50 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 04/24] metric collection: persist state after metric collection Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 05/24] metric collection: skip if last_collection < MIN_COLLECTION_INTERVAL Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 06/24] metric collection: collect overdue metrics on startup/timer change Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 07/24] metric collection: add tests for the fetch_remotes function Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 08/24] metric collection: add test for fetch_overdue Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 09/24] metric collection: pass rrd cache instance as function parameter Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 10/24] metric collection: add test for rrd task Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 11/24] metric collection: wrap rrd_cache::Cache in a struct Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 12/24] metric collection: record remote response time in metric database Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 13/24] metric collection: save time needed for collection run to RRD Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 14/24] metric collection: periodically clean removed remotes from statefile Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 15/24] api: add endpoint to trigger metric collection Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 16/24] api: remotes: trigger immediate metric collection for newly added nodes Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 17/24] api: add api for querying metric collection RRD data Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 18/24] api: metric-collection: add status endpoint Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 19/24] pdm-client: add metric collection API methods Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 20/24] cli: add commands for metric-collection trigger and status Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 21/24] metric collection: skip missed timer ticks Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 22/24] metric collection: use JoinSet instead of joining from handles in a Vec Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 23/24] metric collection: allow to wait until completion when triggering collection manually Lukas Wagner
2025-08-26 13:51 ` [pdm-devel] [PATCH proxmox-datacenter-manager v7 24/24] api: pve: rrd: trigger and wait for metric collection when requesting RRD data Lukas Wagner
2025-08-28 19:37 ` [pdm-devel] applied: [PATCH proxmox-datacenter-manager v7 00/24] metric collection improvements (concurrency, API, CLI) Thomas Lamprecht
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.