From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 679841FF13A
	for <inbox@lore.proxmox.com>; Wed, 29 Apr 2026 09:18:45 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 43E6B20619;
	Wed, 29 Apr 2026 09:18:45 +0200 (CEST)
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Date: Wed, 29 Apr 2026 09:18:08 +0200
Message-Id: <DI5GENTUC9C1.37Z3OK3D7QSE0@proxmox.com>
To: "Daniel Kral" <d.kral@proxmox.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value
 to unit interval
From: "Dominik Rusovac" <d.rusovac@proxmox.com>
X-Mailer: aerc 0.20.0
References: <20260427132031.220468-1-d.rusovac@proxmox.com>
 <20260427132031.220468-2-d.rusovac@proxmox.com>
 <DI4O2EE0OX1J.1B95XY2KTZNQ1@proxmox.com>
In-Reply-To: <DI4O2EE0OX1J.1B95XY2KTZNQ1@proxmox.com>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1777446991167
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.434 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was
 blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
 for more information. [stackexchange.com,ipbeja.pt,scheduler.rs]
Message-ID-Hash: KAZKPTXAQE2QNBEF3AYPGUTADHYG6T4C
X-Message-ID-Hash: KAZKPTXAQE2QNBEF3AYPGUTADHYG6T4C
X-MailFrom: d.rusovac@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

On Tue Apr 28, 2026 at 11:05 AM CEST, Daniel Kral wrote:
> On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
>> The currently used load imbalance value is given as the so-called
>> coefficient of variation (CV), a value that may exceed 1. As such, the
>> CV value alone lacks meaning. A CV value of 0.0 means no imbalance, but
>> what does a value of, say, 1.7 mean?
>>
>> Relative to the number of nodes in a cluster, it is possible to
>> determine the upper bound of the CV value [0][1]. By dividing the CV
>> value by its upper bound, the load imbalance can be represented as a
>> value that varies between 0 and 1. Expressing the CV as a percentage
>> makes the concept of load imbalance easier to interpret.
>
> Nice, thanks for the work!
>
> Will test the changes over the week, but just from the better
> readability / interpretability of the imbalance value this should make
> it more user-friendly overall.

thx!

>
>>
>> [0] https://repositorio.ipbeja.pt/server/api/core/bitstreams/8ed9a444-db=
e0-402f-9d2f-90c5bf6e418c/content
>> [1] https://stats.stackexchange.com/questions/18621/maximum-value-of-coe=
fficient-of-variation-for-bounded-data-set
>
> and a good read overall, thanks!
>
> A note above Example 1 and the proposition 13 from the first paper [0]
> is interesting here:
>
>     All these properties refer to the case where a single sample is
>     considered, however, as noted by [16], Dodd=E2=80=99s corrected coeff=
icient
>     of variation, CV corr is not suitable for comparative purpose, as
>     can be seen in the next example.
>
>     Example 1. [...]
>
>     [...]
>
>     Proposition 13. Dodd=E2=80=99s corrected coefficient of variation, CV=
 corr,
>     is sample-size sensitive.
>
> AFAICT this should not be a problem for us to compare these values to
> make decisions since we only compare the values for the same cluster
> size configuration and therefore this shouldn't affect us badly.

I agree

>
>
> On another note, it should also be easier for users to set a imbalance
> threshold value, which is (at least roughly) invariant to the size of
> the cluster. If one or more nodes failed or are in maintenance node,
> this dramatically changes the decisions that the load balancer can make
> anyway, but I wonder whether how much difference it makes to the
> sensitivity to trigger the load balancing system.
>
>>
>> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
>> ---
>>  proxmox-resource-scheduling/src/scheduler.rs | 33 +++++++++++++-------
>>  1 file changed, 21 insertions(+), 12 deletions(-)
>>

[snip]

>>  fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&Node=
Usage) -> f64) -> f64 {
>> -    let node_count =3D nodes.len();
>> +    let node_count =3D nodes.len() as f64;
>
> even though this reduces the amount of 'as f64's below, the node count
> is by its nature a positive integer so it should stay that way.
>

ack

>> +
>> +    // imbalance is perfect for less than 2 nodes
>> +    if node_count < 2.0 {
>> +        return 0.0;
>> +    }
>
> this could replace the check `load_sum =3D=3D 0.0` below, which also make=
s
> sure that we don't ever divide by zero (and return a NaN as a result),
> and move the assignments for node_loads and load_sum into the else
> branch's code path.
>
> A comment could make it more explicit that this avoids dividing by zero.
>

true, will use early-return pattern, remove the else-branch and add a comme=
nt in v2

>> +
>>      let node_loads =3D nodes.iter().map(to_load).collect::<Vec<_>>();
>> =20
>>      let load_sum =3D node_loads.iter().sum::<f64>();
>> @@ -36,14 +42,17 @@ fn calculate_node_imbalance(nodes: &[NodeUsage], to_=
load: impl Fn(&NodeUsage) ->
>>      if load_sum =3D=3D 0.0 {
>>          0.0
>>      } else {
>> -        let load_mean =3D load_sum / node_count as f64;
>> +        let load_mean =3D load_sum / node_count;
>> =20
>>          let squared_diff_sum =3D node_loads
>>              .iter()
>>              .fold(0.0, |sum, node_load| sum + (node_load - load_mean).p=
owi(2));
>> -        let load_sd =3D (squared_diff_sum / node_count as f64).sqrt();
>> +        let load_sd =3D (squared_diff_sum / node_count).sqrt();
>> +
>> +        let max_cv =3D (node_count - 1.0).sqrt();
>> +        let cv =3D load_sd / load_mean;
>
> nit: just for aesthetics, could be reordered to cv and then max_cv

I agree

>
> also to not loose the reference from the patch message with future
> changes, it would be nice to add a comment here why the calculation for
> max_cv is correct.
>

ack, will add some ref in v2

[snip]