From: "Dominik Rusovac" <d.rusovac@proxmox.com>
To: "Daniel Kral" <d.kral@proxmox.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [PATCH proxmox v2 04/40] resource-scheduling: introduce generic scheduler implementation
Date: Thu, 26 Mar 2026 11:19:08 +0100 [thread overview]
Message-ID: <DHCMYPYGMY0I.179IM7YU0XVMC@proxmox.com> (raw)
In-Reply-To: <20260324183029.1274972-5-d.kral@proxmox.com>
lgtm
pls find my comments inline, mostly nits.
On Tue Mar 24, 2026 at 7:29 PM CET, Daniel Kral wrote:
> The existing score_nodes_to_start_resource(...) function is dependent on
> the StaticNodeUsage and StaticServiceUsage structs.
>
> To use this function for other usage stats structs as well, declare
> generic NodeStats and ResourceStats structs, that the users can convert
> into. These are used to make score_nodes_to_start_resource(...) and its
> documentation generic.
>
> The pve_static::score_nodes_to_start_service(...) is marked as
> deprecated accordingly. The usage-related structs are marked as
> deprecated as well as the specific usage implementations - including
> their serialization and deserialization - should be handled by the
> caller now.
>
> This is best viewed with the git option --ignore-all-space.
>
> No functional changes intended.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
> This patch was "[RFC proxmox 2/5] resource-scheduling: introduce generic
> cluster usage implementation" in the RFC v1. Sorry for the ambigious
> naming with the next patch!
>
> changes v1 -> v2:
> - add more information to the patch message
> - split out `NodeStats` and `ResourceStats` to their own modules, which
> will also be used in an upcoming patch for the `Usage` implementation
> - add deprecation note to pve_static::score_nodes_to_start_service()
> - add deprecation attribute to other major pve_static items as well
> - impl `Add` and `Sum` for ResourceStats, which will be used for the
> resource bundling in pve-rs later
> - eagerly implement common traits (especially Clone and Debug)
> - add test cases for the
> scheduler::Scheduler::score_nodes_to_start_resource()
[snip]
good to have:
#[derive(Clone, Debug)]
> +pub struct Scheduler {
> + nodes: Vec<NodeUsage>,
> +}
> +
nit: The implementation of `Scheduler` is totally fine as-is. This is
just my two cents, as this was mentioned off-list.
I believe that for the implementation of the scheduler working with
enum variants and a trait and then exploiting static dispatch is more
convenient and easier to maintain, e.g.:
pub enum Schedulerr<Nodes> {
Topsis(Nodes),
BruteForce(Nodes),
}
pub trait Decide {
fn node_imbalance(&self) -> f64;
fn node_imbalance_with_migration_candidate(&self, candidate: &MigrationCandidate) -> f64;
fn score_best_balancing_migration_candidates(
&self,
candidates: &[MigrationCandidate],
limit: usize,
) -> Result<Vec<ScoredMigration>, Error>;
}
impl Decide for Schedulerr<Vec<NodeUsage>> {
fn node_imbalance(&self) -> f64 {
match self {
Self::Topsis(nodes) | Self::BruteForce(nodes) => {
calculate_node_imbalance(nodes, |node| node.stats.load())
}
}
}
fn node_imbalance_with_migration_candidate(&self, candidate: &MigrationCandidate) -> f64 {
match self {
Self::Topsis(nodes) | Self::BruteForce(nodes) => {
calculate_node_imbalance(nodes, |node| {
let mut new_stats = node.stats;
if node.name == candidate.migration.source_node {
new_stats.remove_running_resource(&candidate.stats);
} else if node.name == candidate.migration.target_node {
new_stats.add_running_resource(&candidate.stats);
}
new_stats.load()
})
}
}
}
fn score_best_balancing_migration_candidates(
&self,
candidates: &[MigrationCandidate],
limit: usize,
) -> Result<Vec<ScoredMigration>, Error> {
match self {
Self::Topsis(nodes) => {
let len = nodes.len();
let matrix = candidates
.iter()
.map(|candidate| {
let resource_stats = &candidate.stats;
let source_node = &candidate.migration.source_node;
let target_node = &candidate.migration.target_node;
let mut highest_cpu = 0.0;
let mut squares_cpu = 0.0;
let mut highest_mem = 0.0;
let mut squares_mem = 0.0;
for node in nodes.iter() {
let new_stats = {
let mut new_stats = node.stats;
if &node.name == source_node {
new_stats.remove_running_resource(resource_stats);
} else if &node.name == target_node {
new_stats.add_running_resource(resource_stats);
}
new_stats
};
let new_cpu = new_stats.cpu_load();
highest_cpu = f64::max(highest_cpu, new_cpu);
squares_cpu += new_cpu.powi(2);
let new_mem = new_stats.mem_load();
highest_mem = f64::max(highest_mem, new_mem);
squares_mem += new_mem.powi(2);
}
// Add 1.0 to avoid boosting tiny differences: e.g. 0.004 is twice as much as 0.002, but
// 1.004 is only slightly more than 1.002.
PveTopsisAlternative {
average_cpu: 1.0 + (squares_cpu / len as f64).sqrt(),
highest_cpu: 1.0 + highest_cpu,
average_memory: 1.0 + (squares_mem / len as f64).sqrt(),
highest_memory: 1.0 + highest_mem,
}
.into()
})
.collect::<Vec<_>>();
let best_alternatives = topsis::rank_alternatives(
&topsis::Matrix::new(matrix)?,
&PVE_HA_TOPSIS_CRITERIA,
)?;
Ok(best_alternatives
.into_iter()
.take(limit)
.map(|i| {
let imbalance =
self.node_imbalance_with_migration_candidate(&candidates[i]);
ScoredMigration::new(candidates[i].clone(), imbalance)
})
.collect())
}
Self::BruteForce(_) => {
let mut scored_migrations = candidates
.iter()
.map(|candidate| {
let imbalance = self.node_imbalance_with_migration_candidate(candidate);
// NOTE: could avoid clone if Migration had additional score field
Reverse(ScoredMigration::new(candidate.clone(), imbalance))
})
.collect::<BinaryHeap<_>>();
let mut best_migrations = Vec::with_capacity(limit);
// BinaryHeap::into_iter_sorted() is still in nightly unfortunately
while best_migrations.len() < limit {
match scored_migrations.pop() {
Some(Reverse(alternative)) => best_migrations.push(alternative),
None => break,
}
}
Ok(best_migrations)
}
}
}
}
pub fn score_best_balancing_migration_candidates(
scheduler: &impl Decide,
candidates: &[MigrationCandidate],
limit: usize,
) -> Result<Vec<ScoredMigration>, Error> {
scheduler.score_best_balancing_migration_candidates(candidates, limit)
}
In a nutshell, this declares what a scheduler ought to be able to do to be used
for scoring (that is, implementing the `Decide` trait); and implements
all the functionality for all the variants in one place.
Nice side-effects of this design:
* one scoring function implements all the variants in one place, which is nice, I think
* adding/removing a scheduler variant would become more systematic
* modifying scheduler variants in terms of how they score or, for example, how they measure
imbalance, would also be more straightforward
Again, just my two cents.
> +impl Scheduler {
> + /// Instantiate scheduler instance from node usages.
> + pub fn from_nodes<I>(nodes: I) -> Self
> + where
> + I: IntoIterator<Item: Into<NodeUsage>>,
> + {
> + Self {
> + nodes: nodes.into_iter().map(|node| node.into()).collect(),
> + }
> + }
> +
> + /// Scores nodes to start a resource with the usage statistics `resource_stats` on.
> + ///
> + /// The scoring is done as if the resource is already started on each node. This assumes that
> + /// the already started resource consumes the maximum amount of each stat according to its
> + /// `resource_stats`.
> + ///
> + /// Returns a vector of (nodename, score) pairs. Scores are between 0.0 and 1.0 and a higher
> + /// score is better.
> + pub fn score_nodes_to_start_resource<T: Into<ResourceStats>>(
> + &self,
> + resource_stats: T,
> + ) -> Result<Vec<(String, f64)>, Error> {
> + let len = self.nodes.len();
> + let resource_stats = resource_stats.into();
> +
> + let matrix = self
> + .nodes
> + .iter()
> + .enumerate()
> + .map(|(target_index, _)| {
> + // Base values on percentages to allow comparing nodes with different stats.
> + let mut highest_cpu = 0.0;
> + let mut squares_cpu = 0.0;
> + let mut highest_mem = 0.0;
> + let mut squares_mem = 0.0;
> +
> + for (index, node) in self.nodes.iter().enumerate() {
> + let mut new_stats = node.stats;
> +
> + if index == target_index {
> + new_stats.add_started_resource(&resource_stats)
> + };
> +
> + let new_cpu = new_stats.cpu_load();
> + highest_cpu = f64::max(highest_cpu, new_cpu);
> + squares_cpu += new_cpu.powi(2);
> +
> + let new_mem = new_stats.mem_load();
> + highest_mem = f64::max(highest_mem, new_mem);
> + squares_mem += new_mem.powi(2);
> + }
> +
> + // Add 1.0 to avoid boosting tiny differences: e.g. 0.004 is twice as much as 0.002, but
> + // 1.004 is only slightly more than 1.002.
> + PveTopsisAlternative {
> + average_cpu: 1.0 + (squares_cpu / len as f64).sqrt(),
> + highest_cpu: 1.0 + highest_cpu,
> + average_memory: 1.0 + (squares_mem / len as f64).sqrt(),
> + highest_memory: 1.0 + highest_mem,
> + }
> + .into()
> + })
> + .collect::<Vec<_>>();
> +
> + let scores =
> + topsis::score_alternatives(&topsis::Matrix::new(matrix)?, &PVE_HA_TOPSIS_CRITERIA)?;
> +
> + Ok(scores
> + .into_iter()
> + .enumerate()
> + .map(|(n, score)| (self.nodes[n].name.to_string(), score))
> + .collect())
> + }
> }
[snip]
in the future, the proptest crate could come in handy for such helper
functions.
in general, I would propose to add proptests in the future.
for some examples and inspiration, see [0].
[0] https://lore.proxmox.com/all/20260306082046.34311-1-d.rusovac@proxmox.com/T/
> +fn new_homogeneous_cluster_scheduler() -> Scheduler {
> + let (maxcpu, maxmem) = (16, 64 * (1 << 30));
> +
> + let node1 = NodeUsage {
> + name: String::from("node1"),
> + stats: NodeStats {
> + cpu: 1.7,
> + maxcpu,
> + mem: 12334 << 20,
> + maxmem,
> + },
> + };
> +
> + let node2 = NodeUsage {
> + name: String::from("node2"),
> + stats: NodeStats {
> + cpu: 15.184,
> + maxcpu,
> + mem: 529 << 20,
> + maxmem,
> + },
> + };
> +
> + let node3 = NodeUsage {
> + name: String::from("node3"),
> + stats: NodeStats {
> + cpu: 5.2,
> + maxcpu,
> + mem: 9381 << 20,
> + maxmem,
> + },
> + };
> +
> + Scheduler::from_nodes(vec![node1, node2, node3])
> +}
> +
> +fn new_heterogeneous_cluster_scheduler() -> Scheduler {
> + let node1 = NodeUsage {
> + name: String::from("node1"),
> + stats: NodeStats {
> + cpu: 1.7,
> + maxcpu: 16,
> + mem: 12334 << 20,
> + maxmem: 128 << 30,
> + },
> + };
> +
> + let node2 = NodeUsage {
> + name: String::from("node2"),
> + stats: NodeStats {
> + cpu: 15.184,
> + maxcpu: 32,
> + mem: 529 << 20,
> + maxmem: 96 << 30,
> + },
> + };
> +
> + let node3 = NodeUsage {
> + name: String::from("node3"),
> + stats: NodeStats {
> + cpu: 5.2,
> + maxcpu: 24,
> + mem: 9381 << 20,
> + maxmem: 64 << 30,
> + },
> + };
> +
> + Scheduler::from_nodes(vec![node1, node2, node3])
> +}
[snip]
> +#[test]
> +fn test_score_homogeneous_nodes_to_start_resource() -> Result<(), Error> {
> + let scheduler = new_homogeneous_cluster_scheduler();
> +
> + let heavy_memory_resource_stats = ResourceStats {
> + cpu: 0.0,
> + maxcpu: 1.0,
> + mem: 0,
> + maxmem: 12 << 30,
> + };
> +
> + assert_eq!(
> + rank_nodes_to_start_resource(&scheduler, heavy_memory_resource_stats)?,
> + vec!["node2", "node3", "node1"]
> + );
> +
> + let heavy_cpu_resource_stats = ResourceStats {
> + cpu: 0.0,
> + maxcpu: 12.0,
> + mem: 0,
> + maxmem: 0,
> + };
> +
> + assert_eq!(
> + rank_nodes_to_start_resource(&scheduler, heavy_cpu_resource_stats)?,
> + vec!["node1", "node3", "node2"]
> + );
> +
> + let unlimited_cpu_resource_stats = ResourceStats {
> + cpu: 0.0,
> + maxcpu: 0.0,
> + mem: 0,
> + maxmem: 0,
> + };
> +
> + assert_eq!(
> + rank_nodes_to_start_resource(&scheduler, unlimited_cpu_resource_stats)?,
> + vec!["node1", "node3", "node2"]
> + );
> +
nit: confusing variable name
> + let unlimited_cpu_resource_stats = ResourceStats {
> + cpu: 0.0,
> + maxcpu: 12.0,
> + mem: 0,
> + maxmem: 12 << 30,
> + };
> +
> + assert_eq!(
nit: confusing variable name
> + rank_nodes_to_start_resource(&scheduler, unlimited_cpu_resource_stats)?,
> + vec!["node2", "node3", "node1"]
> + );
> +
> + Ok(())
> +}
> +
> +#[test]
> +fn test_score_heterogeneous_nodes_to_start_resource() -> Result<(), Error> {
> + let scheduler = new_heterogeneous_cluster_scheduler();
> +
> + let heavy_memory_resource_stats = ResourceStats {
> + cpu: 0.0,
> + maxcpu: 1.0,
> + mem: 0,
> + maxmem: 12 << 30,
> + };
> +
> + assert_eq!(
> + rank_nodes_to_start_resource(&scheduler, heavy_memory_resource_stats)?,
> + vec!["node2", "node1", "node3"]
> + );
> +
> + let heavy_cpu_resource_stats = ResourceStats {
> + cpu: 0.0,
> + maxcpu: 12.0,
> + mem: 0,
> + maxmem: 0,
> + };
> +
> + assert_eq!(
> + rank_nodes_to_start_resource(&scheduler, heavy_cpu_resource_stats)?,
> + vec!["node3", "node2", "node1"]
> + );
> +
> + let unlimited_cpu_resource_stats = ResourceStats {
> + cpu: 0.0,
> + maxcpu: 0.0,
> + mem: 0,
> + maxmem: 0,
> + };
> +
> + assert_eq!(
> + rank_nodes_to_start_resource(&scheduler, unlimited_cpu_resource_stats)?,
> + vec!["node1", "node3", "node2"]
> + );
> +
nit: confusing variable name
> + let unlimited_cpu_resource_stats = ResourceStats {
> + cpu: 0.0,
> + maxcpu: 12.0,
> + mem: 0,
> + maxmem: 12 << 30,
> + };
> +
nit: confusing variable name
> + assert_eq!(
> + rank_nodes_to_start_resource(&scheduler, unlimited_cpu_resource_stats)?,
> + vec!["node2", "node1", "node3"]
> + );
> +
> + Ok(())
> +}
next prev parent reply other threads:[~2026-03-26 10:19 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-24 18:29 [PATCH cluster/ha-manager/perl-rs/proxmox v2 00/40] dynamic scheduler + load rebalancer Daniel Kral
2026-03-24 18:29 ` [PATCH proxmox v2 01/40] resource-scheduling: inline add_cpu_usage in score_nodes_to_start_service Daniel Kral
2026-03-26 10:10 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH proxmox v2 02/40] resource-scheduling: move score_nodes_to_start_service to scheduler crate Daniel Kral
2026-03-26 10:11 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH proxmox v2 03/40] resource-scheduling: rename service to resource where appropriate Daniel Kral
2026-03-26 10:12 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH proxmox v2 04/40] resource-scheduling: introduce generic scheduler implementation Daniel Kral
2026-03-26 10:19 ` Dominik Rusovac [this message]
2026-03-26 14:16 ` Daniel Kral
2026-03-24 18:29 ` [PATCH proxmox v2 05/40] resource-scheduling: implement generic cluster usage implementation Daniel Kral
2026-03-26 10:28 ` Dominik Rusovac
2026-03-26 14:15 ` Daniel Kral
2026-03-24 18:29 ` [PATCH proxmox v2 06/40] resource-scheduling: topsis: handle empty criteria without panics Daniel Kral
2026-03-26 10:29 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH proxmox v2 07/40] resource-scheduling: compare by nodename in score_nodes_to_start_resource Daniel Kral
2026-03-26 10:29 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH proxmox v2 08/40] resource-scheduling: factor out topsis alternative mapping Daniel Kral
2026-03-26 10:30 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH proxmox v2 09/40] resource-scheduling: implement rebalancing migration selection Daniel Kral
2026-03-26 10:34 ` Dominik Rusovac
2026-03-26 14:11 ` Daniel Kral
2026-03-27 9:34 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH perl-rs v2 10/40] pve-rs: resource-scheduling: remove pedantic error handling from remove_node Daniel Kral
2026-03-27 9:38 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH perl-rs v2 11/40] pve-rs: resource-scheduling: remove pedantic error handling from remove_service_usage Daniel Kral
2026-03-27 9:39 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH perl-rs v2 12/40] pve-rs: resource-scheduling: move pve_static into resource_scheduling module Daniel Kral
2026-03-27 9:41 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH perl-rs v2 13/40] pve-rs: resource-scheduling: use generic usage implementation Daniel Kral
2026-03-27 14:13 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH perl-rs v2 14/40] pve-rs: resource-scheduling: static: replace deprecated usage structs Daniel Kral
2026-03-27 14:18 ` Dominik Rusovac
2026-03-24 18:29 ` [PATCH perl-rs v2 15/40] pve-rs: resource-scheduling: implement pve_dynamic bindings Daniel Kral
2026-03-27 14:15 ` Dominik Rusovac
2026-03-24 18:30 ` [PATCH perl-rs v2 16/40] pve-rs: resource-scheduling: expose auto rebalancing methods Daniel Kral
2026-03-27 14:16 ` Dominik Rusovac
2026-03-24 18:30 ` [PATCH cluster v2 17/40] datacenter config: restructure verbose description for the ha crs option Daniel Kral
2026-03-24 18:30 ` [PATCH cluster v2 18/40] datacenter config: add dynamic load scheduler option Daniel Kral
2026-03-24 18:30 ` [PATCH cluster v2 19/40] datacenter config: add auto rebalancing options Daniel Kral
2026-03-26 16:08 ` Jillian Morgan
2026-03-26 16:20 ` Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 20/40] env: pve2: implement dynamic node and service stats Daniel Kral
2026-03-25 21:43 ` Thomas Lamprecht
2026-03-24 18:30 ` [PATCH ha-manager v2 21/40] sim: hardware: pass correct types for static stats Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 22/40] sim: hardware: factor out static stats' default values Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 23/40] sim: hardware: fix static stats guard Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 24/40] sim: hardware: handle dynamic service stats Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 25/40] sim: hardware: add set-dynamic-stats command Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 26/40] sim: hardware: add getters for dynamic {node,service} stats Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 27/40] usage: pass service data to add_service_usage Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 28/40] usage: pass service data to get_used_service_nodes Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 29/40] add running flag to cluster service stats Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 30/40] usage: use add_service to add service usage to nodes Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 31/40] usage: add dynamic usage scheduler Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 32/40] test: add dynamic usage scheduler test cases Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 33/40] manager: rename execute_migration to queue_resource_motion Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 34/40] manager: update_crs_scheduler_mode: factor out crs config Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 35/40] implement automatic rebalancing Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 36/40] test: add resource bundle generation test cases Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 37/40] test: add dynamic automatic rebalancing system " Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 38/40] test: add static " Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 39/40] test: add automatic rebalancing system test cases with TOPSIS method Daniel Kral
2026-03-24 18:30 ` [PATCH ha-manager v2 40/40] test: add automatic rebalancing system test cases with affinity rules Daniel Kral
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DHCMYPYGMY0I.179IM7YU0XVMC@proxmox.com \
--to=d.rusovac@proxmox.com \
--cc=d.kral@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox