From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 679841FF13A for ; Wed, 29 Apr 2026 09:18:45 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 43E6B20619; Wed, 29 Apr 2026 09:18:45 +0200 (CEST) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Wed, 29 Apr 2026 09:18:08 +0200 Message-Id: To: "Daniel Kral" , Subject: Re: [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value to unit interval From: "Dominik Rusovac" X-Mailer: aerc 0.20.0 References: <20260427132031.220468-1-d.rusovac@proxmox.com> <20260427132031.220468-2-d.rusovac@proxmox.com> In-Reply-To: X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1777446991167 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.434 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [stackexchange.com,ipbeja.pt,scheduler.rs] Message-ID-Hash: KAZKPTXAQE2QNBEF3AYPGUTADHYG6T4C X-Message-ID-Hash: KAZKPTXAQE2QNBEF3AYPGUTADHYG6T4C X-MailFrom: d.rusovac@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue Apr 28, 2026 at 11:05 AM CEST, Daniel Kral wrote: > On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote: >> The currently used load imbalance value is given as the so-called >> coefficient of variation (CV), a value that may exceed 1. As such, the >> CV value alone lacks meaning. A CV value of 0.0 means no imbalance, but >> what does a value of, say, 1.7 mean? >> >> Relative to the number of nodes in a cluster, it is possible to >> determine the upper bound of the CV value [0][1]. By dividing the CV >> value by its upper bound, the load imbalance can be represented as a >> value that varies between 0 and 1. Expressing the CV as a percentage >> makes the concept of load imbalance easier to interpret. > > Nice, thanks for the work! > > Will test the changes over the week, but just from the better > readability / interpretability of the imbalance value this should make > it more user-friendly overall. thx! > >> >> [0] https://repositorio.ipbeja.pt/server/api/core/bitstreams/8ed9a444-db= e0-402f-9d2f-90c5bf6e418c/content >> [1] https://stats.stackexchange.com/questions/18621/maximum-value-of-coe= fficient-of-variation-for-bounded-data-set > > and a good read overall, thanks! > > A note above Example 1 and the proposition 13 from the first paper [0] > is interesting here: > > All these properties refer to the case where a single sample is > considered, however, as noted by [16], Dodd=E2=80=99s corrected coeff= icient > of variation, CV corr is not suitable for comparative purpose, as > can be seen in the next example. > > Example 1. [...] > > [...] > > Proposition 13. Dodd=E2=80=99s corrected coefficient of variation, CV= corr, > is sample-size sensitive. > > AFAICT this should not be a problem for us to compare these values to > make decisions since we only compare the values for the same cluster > size configuration and therefore this shouldn't affect us badly. I agree > > > On another note, it should also be easier for users to set a imbalance > threshold value, which is (at least roughly) invariant to the size of > the cluster. If one or more nodes failed or are in maintenance node, > this dramatically changes the decisions that the load balancer can make > anyway, but I wonder whether how much difference it makes to the > sensitivity to trigger the load balancing system. > >> >> Signed-off-by: Dominik Rusovac >> --- >> proxmox-resource-scheduling/src/scheduler.rs | 33 +++++++++++++------- >> 1 file changed, 21 insertions(+), 12 deletions(-) >> [snip] >> fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&Node= Usage) -> f64) -> f64 { >> - let node_count =3D nodes.len(); >> + let node_count =3D nodes.len() as f64; > > even though this reduces the amount of 'as f64's below, the node count > is by its nature a positive integer so it should stay that way. > ack >> + >> + // imbalance is perfect for less than 2 nodes >> + if node_count < 2.0 { >> + return 0.0; >> + } > > this could replace the check `load_sum =3D=3D 0.0` below, which also make= s > sure that we don't ever divide by zero (and return a NaN as a result), > and move the assignments for node_loads and load_sum into the else > branch's code path. > > A comment could make it more explicit that this avoids dividing by zero. > true, will use early-return pattern, remove the else-branch and add a comme= nt in v2 >> + >> let node_loads =3D nodes.iter().map(to_load).collect::>(); >> =20 >> let load_sum =3D node_loads.iter().sum::(); >> @@ -36,14 +42,17 @@ fn calculate_node_imbalance(nodes: &[NodeUsage], to_= load: impl Fn(&NodeUsage) -> >> if load_sum =3D=3D 0.0 { >> 0.0 >> } else { >> - let load_mean =3D load_sum / node_count as f64; >> + let load_mean =3D load_sum / node_count; >> =20 >> let squared_diff_sum =3D node_loads >> .iter() >> .fold(0.0, |sum, node_load| sum + (node_load - load_mean).p= owi(2)); >> - let load_sd =3D (squared_diff_sum / node_count as f64).sqrt(); >> + let load_sd =3D (squared_diff_sum / node_count).sqrt(); >> + >> + let max_cv =3D (node_count - 1.0).sqrt(); >> + let cv =3D load_sd / load_mean; > > nit: just for aesthetics, could be reordered to cv and then max_cv I agree > > also to not loose the reference from the patch message with future > changes, it would be nice to add a comment here why the calculation for > max_cv is correct. > ack, will add some ref in v2 [snip]