From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 986981FF13B
	for <inbox@lore.proxmox.com>; Wed, 11 Mar 2026 09:21:33 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 539F6100D9;
	Wed, 11 Mar 2026 09:21:26 +0100 (CET)
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Date: Wed, 11 Mar 2026 09:21:17 +0100
Message-Id: <DGZT2BOJIASQ.GRM78RR23RT7@proxmox.com>
To: "Daniel Kral" <d.kral@proxmox.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [RFC proxmox 4/5] resource-scheduling: implement rebalancing
 migration selection
From: "Dominik Rusovac" <d.rusovac@proxmox.com>
X-Mailer: aerc 0.20.0
References: <20260217141437.584852-1-d.kral@proxmox.com>
 <20260217141437.584852-5-d.kral@proxmox.com>
 <DGYAFS4DDSPR.1NB1BSYR0HIAQ@proxmox.com>
 <DGZ1E90K0WCS.2Y9ZYYI3TD7R3@proxmox.com>
In-Reply-To: <DGZ1E90K0WCS.2Y9ZYYI3TD7R3@proxmox.com>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1773217243510
X-SPAM-LEVEL: Spam detection results:  0
	AWL                    -0.773 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.408 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.819 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.903 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: OSV6KMBN4Z2TEFKNJHBTSYTSQDRPXVVA
X-Message-ID-Hash: OSV6KMBN4Z2TEFKNJHBTSYTSQDRPXVVA
X-MailFrom: d.rusovac@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

On Tue Mar 10, 2026 at 11:40 AM CET, Daniel Kral wrote:
> Thanks a lot for taking the time to review this! I've added some inline
> notes, though most of them are just ACKs.
>
> On Mon Mar 9, 2026 at 2:32 PM CET, Dominik Rusovac wrote:
>> Nice work! Here are my thoughts and some things I noticed.
>>
>> On Tue Feb 17, 2026 at 3:13 PM CET, Daniel Kral wrote:
>>> Assuming that a service will hold the same dynamic resource usage on a
>>> new node as on the previous node, score possible migrations, where:
>>>
>>> - the cluster node imbalance is minimal (bruteforce), or
>>>
>>> - the shifted root mean square and maximum resource usages of the cpu
>>>   and memory is minimal across the cluster nodes (TOPSIS).
>>>
>>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>>> ---
>>> score_best_balancing_migrations() and select_best_balancing_migration()
>>> are separate because there could be future improvements for the single
>>> select, but might be unnecessary and redundant (especially since we nee=
d
>>> to expose it at perlmod and PVE::HA::Usage::{Dynamic,Static} twice).
>>
>> Regarding future improvements for the "single select", providing an
>> `imbalance_threshold` to use for early termination may pay off
>> performance-wise, e.g., something along the lines of:
>
> The main idea to limit it in size is because of the
> serialization/deserialization to Perl values, but writing it out makes
> it clear that this should probably be the responsibility of the caller
> (i.e., the perlmod bindings).
>
> I might consider dropping the select_best_balancing_migration(...) here
> entirely as we can still add it later on and use
> score_best_balancing_migrations(..., limit=3D1) anywhere else now...

Sure. I just wanted to emphasize that `limit=3D1` is a special case where
we might want to terminate earlier, that is, whenever a migration
candidate is found that meets our expectations, e.g., "the imbalance
ought to be at most 0.7" (threshold is 0.7) or "the current imbalance
ought to improve by 0.3" (threshold is current imbalance - 0.3).
Providing such a threshold would of course also be the responsibility of
the caller.

>
>>
>>     pub fn select_best_balancing_migration<I>(
>>         &self,
>>         candidates: I,
>>         imbalance_threshold: Option<f64>
>>     ) -> Option<ScoredMigration>
>>     where
>>         I: IntoIterator<Item =3D MigrationCandidate>,
>>     {
>>         let threshold =3D imbalance_threshold.unwrap_or(0.0);
>>
>>         let mut scored_migrations =3D BinaryHeap::new();
>>         for candidate in candidates {
>>                 let imbalance =3D self.node_imbalance_with_migration(&ca=
ndidate);
>>
>>                 let migration =3D ScoredMigration {
>>                     migration: candidate.into(),
>>                     imbalance,
>>                 };
>>
>>                 if imbalance <=3D threshold {
>>                     return Some(migration);
>>                 }
>
> Hm, even though this could be nice, I'm not so sure about it as setting
> a threshold externally might not be that easy and this will not consider

These considerations might be overkill, but atm I cannot foresee how
much of a bottleneck bruteforcing our way through all candidates is, in
real-world scenarios. The idea behind setting a threshold is to mitigate
the risk of inefficient bruteforce search by trading off the quality of
solutions for efficiency.

In general, such a threshold may simply serve as a configurable
parameter that declares the amount of imbalance a user is willing to
accept for the sake of quicker decisions. The threshold value could, for
example, be 0.0, unless otherwise specified.

> any later possibly better migration candidates?
>

Indeed, terminating earlier on a threshold > 0.0 may overlook better
migration candidates, but this is part of the trade-off. Relating to
this, in my understanding, no candidate can score an imbalance < 0.0.
So, at least we can always terminate whenever a candidate that scores an
imbalance of 0.0 is found.

>>
>>                 scored_migrations.push(migration);
>>             }
>>
>>         scored_migrations.pop()
>>     }
>>>
>>
>> [snip]=20
>>
>>> +fn calculate_node_loads(nodes: &[NodeStats]) -> Vec<f64> {
>>> +    nodes.iter().map(|stats| stats.load()).collect()
>>> +}
>>> +
>>> +/// Returns the load imbalance among the nodes.
>>> +///
>>> +/// The load balance is measured as the statistical dispersion of the =
individual node loads.
>>> +///
>>> +/// The current implementation uses the dimensionless coefficient of v=
ariation, which expresses the
>>> +/// standard deviation in relation to the average mean of the node loa=
ds. Additionally, the
>>> +/// coefficient of variation is not robust, which is
>>
>> Parts of the docs are missing, it seems.
>
> Right, will fix that in a v2!
>
>>
>>> +fn calculate_node_imbalance(nodes: &[NodeStats]) -> f64 {
>>> +    let node_count =3D nodes.len();
>>> +    let node_loads =3D calculate_node_loads(nodes);
>>> +
>>> +    let load_sum =3D node_loads
>>> +        .iter()
>>> +        .fold(0.0, |sum, node_load| sum + node_load);
>>
>> To increase readability, consider:
>>    =20
>>     let load_sum =3D node_loads.iter().sum::<f64>();
>
> Will do!
>
> I still want to document what Dominik pointed out to me off-list: Even
> though that Iterator<Item =3D f64>::sum() will return -0.0 for iterators
> that yield nothing, but the following `load_sum =3D=3D 0.0` test will sti=
ll
> hold true as -0.0 =3D=3D 0.0 in IEEE754 [0].
>
> [0] https://en.wikipedia.org/wiki/IEEE_754#Comparison_predicates
>

Forgot to elaborate on this here, thx for mentioning it.

>>
>>> +
>>> +    // load_sum is guaranteed to be 0.0 for empty nodes
>>> +    if load_sum =3D=3D 0.0 {
>>> +        0.0
>>> +    } else {
>>> +        let load_mean =3D load_sum / node_count as f64;
>>> +
>>> +        let squared_diff_sum =3D node_loads
>>> +            .iter()
>>> +            .fold(0.0, |sum, node_load| sum + (node_load - load_mean).=
powi(2));
>>> +        let load_sd =3D (squared_diff_sum / node_count as f64).sqrt();
>>> +
>>> +        load_sd / load_mean
>>> +    }
>>>  }
>>
>> [0] Considering where this function is used, expecting a
>> designated closure that is used to map `NodeUsage` onto `f64` seems=20
>> to be more convenient, e.g.:
>>
>>     fn calculate_node_imbalance(
>>         nodes: &[NodeUsage],
>> 	to_load: impl FnMut(&NodeUsage) -> f64,
>>     ) -> f64 {
>>         let node_count =3D nodes.len();
>> 	let node_loads =3D nodes.iter().map(to_load).collect::<Vec<_>>();
>>
>> 	// ...
>>     }
>>
>
> Great idea! I'll change it to this in the v2!
>
>>> =20
>>
>> [snip]
>>
>>> +    fn node_stats(&self) -> Vec<NodeStats> {
>>> +        self.nodes.iter().map(|node| node.stats).collect()
>>> +    }
>>> +
>>> +    /// Returns the individual node loads.
>>> +    pub fn node_loads(&self) -> Vec<(String, f64)> {
>>> +        self.nodes
>>> +            .iter()
>>> +            .map(|node| (node.name.to_string(), node.stats.load()))
>>> +            .collect()
>>> +    }
>>> +
>>> +    /// Returns the load imbalance among the nodes.
>>> +    ///
>>> +    /// See [`calculate_node_imbalance`] for more information.
>>> +    pub fn node_imbalance(&self) -> f64 {
>>> +        let node_stats =3D self.node_stats();
>>> +
>>> +        calculate_node_imbalance(&node_stats)
>>> +    }
>>
>> Regarding [0], passing a closure compresses two iterations into one, e.g=
.:
>>
>>     pub fn node_imbalance(&self) -> f64 {
>>         calculate_node_imbalance(&self.nodes, |node| node.stats.load())
>>     }
>>
>
> ACK
>
>>> +
>>> +    /// Returns the load imbalance among the nodes as if a specific se=
rvice was moved.
>>> +    ///
>>> +    /// See [`calculate_node_imbalance`] for more information.
>>> +    pub fn node_imbalance_with_migration(&self, migration: &MigrationC=
andidate) -> f64 {
>>> +        let mut new_node_stats =3D Vec::with_capacity(self.nodes.len()=
);
>>> +
>>> +        self.nodes.iter().for_each(|node| {
>>> +            let mut new_stats =3D node.stats;
>>> +
>>> +            if node.name =3D=3D migration.source_node {
>>> +                new_stats.remove_running_service(&migration.stats);
>>> +            } else if node.name =3D=3D migration.target_node {
>>> +                new_stats.add_running_service(&migration.stats);
>>> +            }
>>> +
>>> +            new_node_stats.push(new_stats);
>>> +        });
>>> +
>>> +        calculate_node_imbalance(&new_node_stats)
>>> +    }
>>
>> Same here regarding [0], e.g.:
>>
>>     pub fn node_imbalance_with_migration(&self, migration: &MigrationCan=
didate) -> f64 {
>>         calculate_node_imbalance_some(&self.nodes, |node| {
>>             let mut new_stats =3D node.stats;
>>
>>             if node.name =3D=3D migration.source_node {
>>                 new_stats.remove_running_service(&migration.stats);
>>             } else if node.name =3D=3D migration.target_node {
>>                 new_stats.add_running_service(&migration.stats);
>>             }
>>
>>             new_stats.load()
>>         })
>>     }
>>
>
> ACK, nice!
>
>>> +
>>> +    /// Score the service motions by the best node imbalance improveme=
nt with exhaustive search.
>>> +    pub fn score_best_balancing_migrations<I>(
>>> +        &self,
>>> +        candidates: I,
>>> +        limit: usize,
>>> +    ) -> Result<Vec<ScoredMigration>, Error>
>>
>> Does this really have to return a `Result`? It uses no try operator and
>> provides no error handling.
>
> No it does not, seems like it was more out of habit than necessity. Will
> remove the Result here in a v2!
>
>>
>>> +    where
>>> +        I: IntoIterator<Item =3D MigrationCandidate>,
>>> +    {
>>> +        let mut scored_migrations =3D candidates
>>> +            .into_iter()
>>> +            .map(|candidate| {
>>> +                let imbalance =3D self.node_imbalance_with_migration(&=
candidate);
>>> +
>>> +                ScoredMigration {
>>> +                    migration: candidate.into(),
>>> +                    imbalance,
>>> +                }
>>> +            })
>>> +            .collect::<BinaryHeap<_>>();
>>> +
>>> +        let mut best_alternatives =3D Vec::new();
>>
>> Since `limit` declares the capacity of `best_alternatives`, consider:
>>
>>     let mut best_alternatives =3D Vec::with_capacity(limit);
>
> Thanks, ACK!
>
>>
>>> +
>>> +        // BinaryHeap::into_iter_sorted() is still in nightly unfortun=
ately
>>> +        while best_alternatives.len() < limit {
>>> +            match scored_migrations.pop() {
>>> +                Some(alternative) =3D> best_alternatives.push(alternat=
ive),
>>> +                None =3D> break,
>>> +            }
>>> +        }
>>> +
>>> +        Ok(best_alternatives)
>>> +    }