all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: "Dominik Rusovac" <d.rusovac@proxmox.com>
To: "Daniel Kral" <d.kral@proxmox.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [RFC proxmox 4/5] resource-scheduling: implement rebalancing migration selection
Date: Wed, 11 Mar 2026 09:21:17 +0100	[thread overview]
Message-ID: <DGZT2BOJIASQ.GRM78RR23RT7@proxmox.com> (raw)
In-Reply-To: <DGZ1E90K0WCS.2Y9ZYYI3TD7R3@proxmox.com>

On Tue Mar 10, 2026 at 11:40 AM CET, Daniel Kral wrote:
> Thanks a lot for taking the time to review this! I've added some inline
> notes, though most of them are just ACKs.
>
> On Mon Mar 9, 2026 at 2:32 PM CET, Dominik Rusovac wrote:
>> Nice work! Here are my thoughts and some things I noticed.
>>
>> On Tue Feb 17, 2026 at 3:13 PM CET, Daniel Kral wrote:
>>> Assuming that a service will hold the same dynamic resource usage on a
>>> new node as on the previous node, score possible migrations, where:
>>>
>>> - the cluster node imbalance is minimal (bruteforce), or
>>>
>>> - the shifted root mean square and maximum resource usages of the cpu
>>>   and memory is minimal across the cluster nodes (TOPSIS).
>>>
>>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>>> ---
>>> score_best_balancing_migrations() and select_best_balancing_migration()
>>> are separate because there could be future improvements for the single
>>> select, but might be unnecessary and redundant (especially since we need
>>> to expose it at perlmod and PVE::HA::Usage::{Dynamic,Static} twice).
>>
>> Regarding future improvements for the "single select", providing an
>> `imbalance_threshold` to use for early termination may pay off
>> performance-wise, e.g., something along the lines of:
>
> The main idea to limit it in size is because of the
> serialization/deserialization to Perl values, but writing it out makes
> it clear that this should probably be the responsibility of the caller
> (i.e., the perlmod bindings).
>
> I might consider dropping the select_best_balancing_migration(...) here
> entirely as we can still add it later on and use
> score_best_balancing_migrations(..., limit=1) anywhere else now...

Sure. I just wanted to emphasize that `limit=1` is a special case where
we might want to terminate earlier, that is, whenever a migration
candidate is found that meets our expectations, e.g., "the imbalance
ought to be at most 0.7" (threshold is 0.7) or "the current imbalance
ought to improve by 0.3" (threshold is current imbalance - 0.3).
Providing such a threshold would of course also be the responsibility of
the caller.

>
>>
>>     pub fn select_best_balancing_migration<I>(
>>         &self,
>>         candidates: I,
>>         imbalance_threshold: Option<f64>
>>     ) -> Option<ScoredMigration>
>>     where
>>         I: IntoIterator<Item = MigrationCandidate>,
>>     {
>>         let threshold = imbalance_threshold.unwrap_or(0.0);
>>
>>         let mut scored_migrations = BinaryHeap::new();
>>         for candidate in candidates {
>>                 let imbalance = self.node_imbalance_with_migration(&candidate);
>>
>>                 let migration = ScoredMigration {
>>                     migration: candidate.into(),
>>                     imbalance,
>>                 };
>>
>>                 if imbalance <= threshold {
>>                     return Some(migration);
>>                 }
>
> Hm, even though this could be nice, I'm not so sure about it as setting
> a threshold externally might not be that easy and this will not consider

These considerations might be overkill, but atm I cannot foresee how
much of a bottleneck bruteforcing our way through all candidates is, in
real-world scenarios. The idea behind setting a threshold is to mitigate
the risk of inefficient bruteforce search by trading off the quality of
solutions for efficiency.

In general, such a threshold may simply serve as a configurable
parameter that declares the amount of imbalance a user is willing to
accept for the sake of quicker decisions. The threshold value could, for
example, be 0.0, unless otherwise specified.

> any later possibly better migration candidates?
>

Indeed, terminating earlier on a threshold > 0.0 may overlook better
migration candidates, but this is part of the trade-off. Relating to
this, in my understanding, no candidate can score an imbalance < 0.0.
So, at least we can always terminate whenever a candidate that scores an
imbalance of 0.0 is found.

>>
>>                 scored_migrations.push(migration);
>>             }
>>
>>         scored_migrations.pop()
>>     }
>>>
>>
>> [snip] 
>>
>>> +fn calculate_node_loads(nodes: &[NodeStats]) -> Vec<f64> {
>>> +    nodes.iter().map(|stats| stats.load()).collect()
>>> +}
>>> +
>>> +/// Returns the load imbalance among the nodes.
>>> +///
>>> +/// The load balance is measured as the statistical dispersion of the individual node loads.
>>> +///
>>> +/// The current implementation uses the dimensionless coefficient of variation, which expresses the
>>> +/// standard deviation in relation to the average mean of the node loads. Additionally, the
>>> +/// coefficient of variation is not robust, which is
>>
>> Parts of the docs are missing, it seems.
>
> Right, will fix that in a v2!
>
>>
>>> +fn calculate_node_imbalance(nodes: &[NodeStats]) -> f64 {
>>> +    let node_count = nodes.len();
>>> +    let node_loads = calculate_node_loads(nodes);
>>> +
>>> +    let load_sum = node_loads
>>> +        .iter()
>>> +        .fold(0.0, |sum, node_load| sum + node_load);
>>
>> To increase readability, consider:
>>     
>>     let load_sum = node_loads.iter().sum::<f64>();
>
> Will do!
>
> I still want to document what Dominik pointed out to me off-list: Even
> though that Iterator<Item = f64>::sum() will return -0.0 for iterators
> that yield nothing, but the following `load_sum == 0.0` test will still
> hold true as -0.0 == 0.0 in IEEE754 [0].
>
> [0] https://en.wikipedia.org/wiki/IEEE_754#Comparison_predicates
>

Forgot to elaborate on this here, thx for mentioning it.

>>
>>> +
>>> +    // load_sum is guaranteed to be 0.0 for empty nodes
>>> +    if load_sum == 0.0 {
>>> +        0.0
>>> +    } else {
>>> +        let load_mean = load_sum / node_count as f64;
>>> +
>>> +        let squared_diff_sum = node_loads
>>> +            .iter()
>>> +            .fold(0.0, |sum, node_load| sum + (node_load - load_mean).powi(2));
>>> +        let load_sd = (squared_diff_sum / node_count as f64).sqrt();
>>> +
>>> +        load_sd / load_mean
>>> +    }
>>>  }
>>
>> [0] Considering where this function is used, expecting a
>> designated closure that is used to map `NodeUsage` onto `f64` seems 
>> to be more convenient, e.g.:
>>
>>     fn calculate_node_imbalance(
>>         nodes: &[NodeUsage],
>> 	to_load: impl FnMut(&NodeUsage) -> f64,
>>     ) -> f64 {
>>         let node_count = nodes.len();
>> 	let node_loads = nodes.iter().map(to_load).collect::<Vec<_>>();
>>
>> 	// ...
>>     }
>>
>
> Great idea! I'll change it to this in the v2!
>
>>>  
>>
>> [snip]
>>
>>> +    fn node_stats(&self) -> Vec<NodeStats> {
>>> +        self.nodes.iter().map(|node| node.stats).collect()
>>> +    }
>>> +
>>> +    /// Returns the individual node loads.
>>> +    pub fn node_loads(&self) -> Vec<(String, f64)> {
>>> +        self.nodes
>>> +            .iter()
>>> +            .map(|node| (node.name.to_string(), node.stats.load()))
>>> +            .collect()
>>> +    }
>>> +
>>> +    /// Returns the load imbalance among the nodes.
>>> +    ///
>>> +    /// See [`calculate_node_imbalance`] for more information.
>>> +    pub fn node_imbalance(&self) -> f64 {
>>> +        let node_stats = self.node_stats();
>>> +
>>> +        calculate_node_imbalance(&node_stats)
>>> +    }
>>
>> Regarding [0], passing a closure compresses two iterations into one, e.g.:
>>
>>     pub fn node_imbalance(&self) -> f64 {
>>         calculate_node_imbalance(&self.nodes, |node| node.stats.load())
>>     }
>>
>
> ACK
>
>>> +
>>> +    /// Returns the load imbalance among the nodes as if a specific service was moved.
>>> +    ///
>>> +    /// See [`calculate_node_imbalance`] for more information.
>>> +    pub fn node_imbalance_with_migration(&self, migration: &MigrationCandidate) -> f64 {
>>> +        let mut new_node_stats = Vec::with_capacity(self.nodes.len());
>>> +
>>> +        self.nodes.iter().for_each(|node| {
>>> +            let mut new_stats = node.stats;
>>> +
>>> +            if node.name == migration.source_node {
>>> +                new_stats.remove_running_service(&migration.stats);
>>> +            } else if node.name == migration.target_node {
>>> +                new_stats.add_running_service(&migration.stats);
>>> +            }
>>> +
>>> +            new_node_stats.push(new_stats);
>>> +        });
>>> +
>>> +        calculate_node_imbalance(&new_node_stats)
>>> +    }
>>
>> Same here regarding [0], e.g.:
>>
>>     pub fn node_imbalance_with_migration(&self, migration: &MigrationCandidate) -> f64 {
>>         calculate_node_imbalance_some(&self.nodes, |node| {
>>             let mut new_stats = node.stats;
>>
>>             if node.name == migration.source_node {
>>                 new_stats.remove_running_service(&migration.stats);
>>             } else if node.name == migration.target_node {
>>                 new_stats.add_running_service(&migration.stats);
>>             }
>>
>>             new_stats.load()
>>         })
>>     }
>>
>
> ACK, nice!
>
>>> +
>>> +    /// Score the service motions by the best node imbalance improvement with exhaustive search.
>>> +    pub fn score_best_balancing_migrations<I>(
>>> +        &self,
>>> +        candidates: I,
>>> +        limit: usize,
>>> +    ) -> Result<Vec<ScoredMigration>, Error>
>>
>> Does this really have to return a `Result`? It uses no try operator and
>> provides no error handling.
>
> No it does not, seems like it was more out of habit than necessity. Will
> remove the Result here in a v2!
>
>>
>>> +    where
>>> +        I: IntoIterator<Item = MigrationCandidate>,
>>> +    {
>>> +        let mut scored_migrations = candidates
>>> +            .into_iter()
>>> +            .map(|candidate| {
>>> +                let imbalance = self.node_imbalance_with_migration(&candidate);
>>> +
>>> +                ScoredMigration {
>>> +                    migration: candidate.into(),
>>> +                    imbalance,
>>> +                }
>>> +            })
>>> +            .collect::<BinaryHeap<_>>();
>>> +
>>> +        let mut best_alternatives = Vec::new();
>>
>> Since `limit` declares the capacity of `best_alternatives`, consider:
>>
>>     let mut best_alternatives = Vec::with_capacity(limit);
>
> Thanks, ACK!
>
>>
>>> +
>>> +        // BinaryHeap::into_iter_sorted() is still in nightly unfortunately
>>> +        while best_alternatives.len() < limit {
>>> +            match scored_migrations.pop() {
>>> +                Some(alternative) => best_alternatives.push(alternative),
>>> +                None => break,
>>> +            }
>>> +        }
>>> +
>>> +        Ok(best_alternatives)
>>> +    }





  reply	other threads:[~2026-03-11  8:21 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-17 14:13 [RFC PATCH-SERIES many 00/36] dynamic scheduler + load rebalancer Daniel Kral
2026-02-17 14:13 ` [RFC proxmox 1/5] resource-scheduling: move score_nodes_to_start_service to scheduler crate Daniel Kral
2026-02-17 14:13 ` [RFC proxmox 2/5] resource-scheduling: introduce generic cluster usage implementation Daniel Kral
2026-03-09 13:38   ` Dominik Rusovac
2026-03-10 10:41     ` Daniel Kral
2026-02-17 14:13 ` [RFC proxmox 3/5] resource-scheduling: add dynamic node and service stats Daniel Kral
2026-02-17 14:13 ` [RFC proxmox 4/5] resource-scheduling: implement rebalancing migration selection Daniel Kral
2026-03-09 13:32   ` Dominik Rusovac
2026-03-10 10:40     ` Daniel Kral
2026-03-11  8:21       ` Dominik Rusovac [this message]
2026-02-17 14:13 ` [RFC proxmox 5/5] resource-scheduling: implement Add and Default for {Dynamic,Static}ServiceStats Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 1/6] pve-rs: resource scheduling: use generic cluster usage implementation Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 2/6] pve-rs: resource scheduling: create service_nodes hashset from array Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 3/6] pve-rs: resource scheduling: store service stats independently of node Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 4/6] pve-rs: resource scheduling: expose auto rebalancing methods Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 5/6] pve-rs: resource scheduling: move pve_static into resource_scheduling module Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 6/6] pve-rs: resource scheduling: implement pve_dynamic bindings Daniel Kral
2026-02-17 14:14 ` [RFC cluster 1/2] datacenter config: add dynamic load scheduler option Daniel Kral
2026-02-18 11:06   ` Maximiliano Sandoval
2026-02-17 14:14 ` [RFC cluster 2/2] datacenter config: add auto rebalancing options Daniel Kral
2026-02-18 11:15   ` Maximiliano Sandoval
2026-02-17 14:14 ` [RFC ha-manager 01/21] rename static node stats to be consistent with similar interfaces Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 02/21] resources: remove redundant load_config fallback for static config Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 03/21] remove redundant service_node and migration_target parameter Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 04/21] factor out common pve to ha resource type mapping Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 05/21] derive static service stats while filling the service stats repository Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 06/21] test: make static service usage explicit for all resources Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 07/21] make static service stats indexable by sid Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 08/21] move static service stats repository to PVE::HA::Usage::Static Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 09/21] usage: augment service stats with node and state information Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 10/21] include running non-HA resources in the scheduler's accounting Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 11/21] env, resources: add dynamic node and service stats abstraction Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 12/21] env: pve2: implement dynamic node and service stats Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 13/21] sim: hardware: pass correct types for static stats Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 14/21] sim: hardware: factor out static stats' default values Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 15/21] sim: hardware: rewrite set-static-stats Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 16/21] sim: hardware: add set-dynamic-stats for services Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 17/21] usage: add dynamic usage scheduler Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 18/21] manager: rename execute_migration to queue_resource_motion Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 19/21] manager: update_crs_scheduler_mode: factor out crs config Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 20/21] implement automatic rebalancing Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 21/21] test: add basic automatic rebalancing system test cases Daniel Kral
2026-02-17 14:14 ` [RFC manager 1/2] ui: dc/options: add dynamic load scheduler option Daniel Kral
2026-02-18 11:10   ` Maximiliano Sandoval
2026-02-17 14:14 ` [RFC manager 2/2] ui: dc/options: add auto rebalancing options Daniel Kral
2026-03-12 16:24 ` [RFC PATCH-SERIES many 00/36] dynamic scheduler + load rebalancer DERUMIER, Alexandre
2026-03-13  9:35   ` Daniel Kral

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DGZT2BOJIASQ.GRM78RR23RT7@proxmox.com \
    --to=d.rusovac@proxmox.com \
    --cc=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal