* [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval
@ 2026-04-27 13:20 Dominik Rusovac
2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
` (7 more replies)
0 siblings, 8 replies; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
To: pve-devel
# TL;DR
clamp load imbalance to value between 0 and 1, and display the value as
percentage in HA Status panel of PVE UI.
# Details
The currently used load imbalance value is given as the so-called coefficient of
variation (CV), a value that may exceed 1. As such, the CV value alone lacks
meaning. A CV value of 0.0 means no imbalance, but what does a value of, say,
1.7 mean?
Relative to the number of nodes in a cluster, it is possible to determine the
upper bound of the CV value [0][1]. By dividing the CV value by its upper
bound, the load imbalance can be represented as a value that varies between 0
and 1. Expressing the CV as a percentage makes the concept of load imbalance
easier to interpret.
# Summary of Changes
This series:
- represents load imbalance as a value between 0 and 1;
- adds a maximum value of 1.0 for load scheduler options; and
- integrates the load imbalance value within the HA status endpoint;
this is to provide feedback on the prevailing load imbalance in the PVE UI.
# Refs
[0] https://repositorio.ipbeja.pt/server/api/core/bitstreams/8ed9a444-dbe0-402f-9d2f-90c5bf6e418c/content
[1] https://stats.stackexchange.com/questions/18621/maximum-value-of-coefficient-of-variation-for-bounded-data-set
proxmox:
Dominik Rusovac (2):
resource-scheduling: clamp imbalance value to unit interval
resource-scheduling: re-adjust hardcoded imbalance values
proxmox-resource-scheduling/src/scheduler.rs | 33 ++++++++++++-------
.../tests/scheduler.rs | 8 ++---
2 files changed, 25 insertions(+), 16 deletions(-)
pve-manager:
Dominik Rusovac (1):
ui: from/CRSOptions: add maximum for threshold
www/manager6/form/CRSOptions.js | 1 +
1 file changed, 1 insertion(+)
pve-ha-manager:
Dominik Rusovac (3):
test: re-adjust logged imbalance values
manager: add load imbalance to status
api: status: add load imbalance to status
src/PVE/API2/HA/Status.pm | 4 +-
src/PVE/HA/Manager.pm | 1 +
.../log.expect | 4 +-
.../log.expect | 38 +++++++++----------
.../log.expect | 4 +-
.../log.expect | 29 +++++---------
.../log.expect | 2 +-
.../log.expect | 2 +-
.../log.expect | 4 +-
.../log.expect | 4 +-
.../log.expect | 4 +-
.../log.expect | 22 +----------
12 files changed, 47 insertions(+), 71 deletions(-)
pve-cluster:
Dominik Rusovac (1):
datacenter config: add maxima for load scheduler options
src/PVE/DataCenterConfig.pm | 2 ++
1 file changed, 2 insertions(+)
Summary over all repositories:
16 files changed, 75 insertions(+), 87 deletions(-)
--
Generated by murpp 0.11.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value to unit interval
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
2026-04-28 9:05 ` Daniel Kral
2026-04-27 13:20 ` [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values Dominik Rusovac
` (6 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
To: pve-devel
The currently used load imbalance value is given as the so-called
coefficient of variation (CV), a value that may exceed 1. As such, the
CV value alone lacks meaning. A CV value of 0.0 means no imbalance, but
what does a value of, say, 1.7 mean?
Relative to the number of nodes in a cluster, it is possible to
determine the upper bound of the CV value [0][1]. By dividing the CV
value by its upper bound, the load imbalance can be represented as a
value that varies between 0 and 1. Expressing the CV as a percentage
makes the concept of load imbalance easier to interpret.
[0] https://repositorio.ipbeja.pt/server/api/core/bitstreams/8ed9a444-dbe0-402f-9d2f-90c5bf6e418c/content
[1] https://stats.stackexchange.com/questions/18621/maximum-value-of-coefficient-of-variation-for-bounded-data-set
Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
proxmox-resource-scheduling/src/scheduler.rs | 33 +++++++++++++-------
1 file changed, 21 insertions(+), 12 deletions(-)
diff --git a/proxmox-resource-scheduling/src/scheduler.rs b/proxmox-resource-scheduling/src/scheduler.rs
index 49d16f9f..4eacbff9 100644
--- a/proxmox-resource-scheduling/src/scheduler.rs
+++ b/proxmox-resource-scheduling/src/scheduler.rs
@@ -17,17 +17,23 @@ pub struct NodeUsage {
pub stats: NodeStats,
}
-/// Returns the load imbalance among the nodes.
+/// Returns the load imbalance among the nodes, which is a value between 0 and 1 that describes the
+/// statistical dispersion of the individual node loads around the mean node load. The lower the
+/// value, the better.
///
-/// The load balance is measured as the statistical dispersion of the individual node loads.
-///
-/// The current implementation uses the dimensionless coefficient of variation, which expresses the
-/// standard deviation in relation to the average mean of the node loads.
-///
-/// The coefficient of variation is not robust, which is a desired property here, because outliers
-/// should be detected as much as possible.
+/// In more detail, the current implementation computes the so-called coefficient of variation (CV),
+/// which is the ratio of the standard deviation to the mean of the given node loads. The lower
+/// bound of the CV is reached if all node loads are equal. The upper bound is reached if all nodes
+/// except one are idle. To present the CV as a value between 0 and 1, it's being divided by the
+/// upper bound of the CV for the given number of nodes.
fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&NodeUsage) -> f64) -> f64 {
- let node_count = nodes.len();
+ let node_count = nodes.len() as f64;
+
+ // imbalance is perfect for less than 2 nodes
+ if node_count < 2.0 {
+ return 0.0;
+ }
+
let node_loads = nodes.iter().map(to_load).collect::<Vec<_>>();
let load_sum = node_loads.iter().sum::<f64>();
@@ -36,14 +42,17 @@ fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&NodeUsage) ->
if load_sum == 0.0 {
0.0
} else {
- let load_mean = load_sum / node_count as f64;
+ let load_mean = load_sum / node_count;
let squared_diff_sum = node_loads
.iter()
.fold(0.0, |sum, node_load| sum + (node_load - load_mean).powi(2));
- let load_sd = (squared_diff_sum / node_count as f64).sqrt();
+ let load_sd = (squared_diff_sum / node_count).sqrt();
+
+ let max_cv = (node_count - 1.0).sqrt();
+ let cv = load_sd / load_mean;
- load_sd / load_mean
+ cv / max_cv
}
}
--
2.47.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
2026-04-28 8:53 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold Dominik Rusovac
` (5 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
proxmox-resource-scheduling/tests/scheduler.rs | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/proxmox-resource-scheduling/tests/scheduler.rs b/proxmox-resource-scheduling/tests/scheduler.rs
index be90e4f9..21dbe451 100644
--- a/proxmox-resource-scheduling/tests/scheduler.rs
+++ b/proxmox-resource-scheduling/tests/scheduler.rs
@@ -172,7 +172,7 @@ fn test_score_best_balancing_migration_candidates_with_no_candidates() {
fn test_score_best_balancing_migration_candidates_in_homogeneous_cluster() {
let scheduler = new_homogeneous_cluster_scheduler();
- assert_imbalance(scheduler.node_imbalance(), 0.4893954724628247);
+ assert_imbalance(scheduler.node_imbalance(), 0.3460548572604576);
let (candidates, migration1, migration2) = new_simple_migration_candidates();
@@ -186,7 +186,7 @@ fn test_score_best_balancing_migration_candidates_in_homogeneous_cluster() {
fn test_score_best_balancing_migration_candidates_in_heterogeneous_cluster() {
let scheduler = new_heterogeneous_cluster_scheduler();
- assert_imbalance(scheduler.node_imbalance(), 0.33026013056867354);
+ assert_imbalance(scheduler.node_imbalance(), 0.23352917788066363);
let (candidates, migration1, migration2) = new_simple_migration_candidates();
@@ -225,7 +225,7 @@ fn test_score_best_balancing_migration_candidates_topsis_in_homogeneous_cluster(
) -> Result<(), Error> {
let scheduler = new_homogeneous_cluster_scheduler();
- assert_imbalance(scheduler.node_imbalance(), 0.4893954724628247);
+ assert_imbalance(scheduler.node_imbalance(), 0.3460548572604576);
let (candidates, migration1, migration2) = new_simple_migration_candidates();
@@ -242,7 +242,7 @@ fn test_score_best_balancing_migration_candidates_topsis_in_heterogeneous_cluste
) -> Result<(), Error> {
let scheduler = new_heterogeneous_cluster_scheduler();
- assert_imbalance(scheduler.node_imbalance(), 0.33026013056867354);
+ assert_imbalance(scheduler.node_imbalance(), 0.23352917788066363);
let (candidates, migration1, migration2) = new_simple_migration_candidates();
--
2.47.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
2026-04-27 13:20 ` [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
2026-04-28 8:52 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values Dominik Rusovac
` (4 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
www/manager6/form/CRSOptions.js | 1 +
1 file changed, 1 insertion(+)
diff --git a/www/manager6/form/CRSOptions.js b/www/manager6/form/CRSOptions.js
index b5476bd5..985eb8cf 100644
--- a/www/manager6/form/CRSOptions.js
+++ b/www/manager6/form/CRSOptions.js
@@ -66,6 +66,7 @@ Ext.define('PVE.form.CRSOptions', {
fieldLabel: gettext('Imbalance Threshold'),
emptyText: '0.3',
minValue: 0.0,
+ maxValue: 1.0,
step: 0.01,
bind: {
disabled: '{!enableAutoRebalance.checked}',
--
2.47.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
` (2 preceding siblings ...)
2026-04-27 13:20 ` [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
2026-04-28 8:52 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 5/7] manager: add load imbalance to status Dominik Rusovac
` (3 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
.../log.expect | 4 +-
.../log.expect | 38 +++++++++----------
.../log.expect | 4 +-
.../log.expect | 29 +++++---------
.../log.expect | 2 +-
.../log.expect | 2 +-
.../log.expect | 4 +-
.../log.expect | 4 +-
.../log.expect | 4 +-
.../log.expect | 22 +----------
10 files changed, 43 insertions(+), 70 deletions(-)
diff --git a/src/test/test-crs-dynamic-auto-rebalance-topsis2/log.expect b/src/test/test-crs-dynamic-auto-rebalance-topsis2/log.expect
index 3d79026..83d4e60 100644
--- a/src/test/test-crs-dynamic-auto-rebalance-topsis2/log.expect
+++ b/src/test/test-crs-dynamic-auto-rebalance-topsis2/log.expect
@@ -34,7 +34,7 @@ info 21 node1/lrm: starting service vm:104
info 21 node1/lrm: service status vm:104 started
info 22 node2/crm: status change wait_for_quorum => slave
info 24 node3/crm: status change wait_for_quorum => slave
-info 80 node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.41 to 0.94)
+info 80 node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.00 to 0.66)
info 80 node1/crm: got crm command: migrate vm:101 node2
info 80 node1/crm: migrate service 'vm:101' to node 'node2'
info 80 node1/crm: service 'vm:101': state changed from 'started' to 'migrate' (node = node1, target = node2)
@@ -45,7 +45,7 @@ info 83 node2/lrm: status change wait_for_agent_lock => active
info 100 node1/crm: service 'vm:101': state changed from 'migrate' to 'started' (node = node2)
info 103 node2/lrm: starting service vm:101
info 103 node2/lrm: service status vm:101 started
-info 160 node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.94 to 0.35)
+info 160 node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.66 to 0.25)
info 160 node1/crm: got crm command: migrate vm:102 node3
info 160 node1/crm: migrate service 'vm:102' to node 'node3'
info 160 node1/crm: service 'vm:102': state changed from 'started' to 'migrate' (node = node1, target = node3)
diff --git a/src/test/test-crs-dynamic-auto-rebalance-topsis3/log.expect b/src/test/test-crs-dynamic-auto-rebalance-topsis3/log.expect
index c9fc29e..c539122 100644
--- a/src/test/test-crs-dynamic-auto-rebalance-topsis3/log.expect
+++ b/src/test/test-crs-dynamic-auto-rebalance-topsis3/log.expect
@@ -53,7 +53,7 @@ info 25 node3/lrm: service status vm:107 started
info 120 cmdlist: execute service vm:105 set-dynamic-stats cpu 7.8 mem 7912
info 120 cmdlist: execute service vm:106 set-dynamic-stats cpu 5.7 mem 8192
info 120 cmdlist: execute service vm:107 set-dynamic-stats cpu 6.0 mem 8011
-info 160 node1/crm: auto rebalance - migrate vm:105 to node2 (expected change for imbalance from 0.85 to 0.42)
+info 160 node1/crm: auto rebalance - migrate vm:105 to node2 (expected change for imbalance from 0.60 to 0.30)
info 160 node1/crm: got crm command: migrate vm:105 node2
info 160 node1/crm: migrate service 'vm:105' to node 'node2'
info 160 node1/crm: service 'vm:105': state changed from 'started' to 'migrate' (node = node3, target = node2)
@@ -68,22 +68,22 @@ info 220 cmdlist: execute service vm:104 set-dynamic-stats cpu 6.7 mem 8
info 220 cmdlist: execute service vm:105 set-dynamic-stats cpu 1.8 mem 1201
info 220 cmdlist: execute service vm:106 set-dynamic-stats cpu 2.1 mem 1211
info 220 cmdlist: execute service vm:107 set-dynamic-stats cpu 0.9 mem 1191
-info 240 node1/crm: auto rebalance - migrate vm:103 to node3 (expected change for imbalance from 0.81 to 0.43)
-info 240 node1/crm: got crm command: migrate vm:103 node3
-info 240 node1/crm: migrate service 'vm:103' to node 'node3'
-info 240 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node2, target = node3)
-info 243 node2/lrm: service vm:103 - start migrate to node 'node3'
-info 243 node2/lrm: service vm:103 - end migrate to node 'node3'
-info 260 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node3)
-info 265 node3/lrm: starting service vm:103
-info 265 node3/lrm: service status vm:103 started
-info 320 node1/crm: auto rebalance - migrate vm:105 to node1 (expected change for imbalance from 0.43 to 0.24)
-info 320 node1/crm: got crm command: migrate vm:105 node1
-info 320 node1/crm: migrate service 'vm:105' to node 'node1'
-info 320 node1/crm: service 'vm:105': state changed from 'started' to 'migrate' (node = node2, target = node1)
-info 323 node2/lrm: service vm:105 - start migrate to node 'node1'
-info 323 node2/lrm: service vm:105 - end migrate to node 'node1'
-info 340 node1/crm: service 'vm:105': state changed from 'migrate' to 'started' (node = node1)
-info 341 node1/lrm: starting service vm:105
-info 341 node1/lrm: service status vm:105 started
+info 260 node1/crm: auto rebalance - migrate vm:103 to node3 (expected change for imbalance from 0.57 to 0.30)
+info 260 node1/crm: got crm command: migrate vm:103 node3
+info 260 node1/crm: migrate service 'vm:103' to node 'node3'
+info 260 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node2, target = node3)
+info 263 node2/lrm: service vm:103 - start migrate to node 'node3'
+info 263 node2/lrm: service vm:103 - end migrate to node 'node3'
+info 280 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node3)
+info 285 node3/lrm: starting service vm:103
+info 285 node3/lrm: service status vm:103 started
+info 340 node1/crm: auto rebalance - migrate vm:105 to node1 (expected change for imbalance from 0.30 to 0.17)
+info 340 node1/crm: got crm command: migrate vm:105 node1
+info 340 node1/crm: migrate service 'vm:105' to node 'node1'
+info 340 node1/crm: service 'vm:105': state changed from 'started' to 'migrate' (node = node2, target = node1)
+info 343 node2/lrm: service vm:105 - start migrate to node 'node1'
+info 343 node2/lrm: service vm:105 - end migrate to node 'node1'
+info 360 node1/crm: service 'vm:105': state changed from 'migrate' to 'started' (node = node1)
+info 361 node1/lrm: starting service vm:105
+info 361 node1/lrm: service status vm:105 started
info 820 hardware: exit simulation - done
diff --git a/src/test/test-crs-dynamic-auto-rebalance2/log.expect b/src/test/test-crs-dynamic-auto-rebalance2/log.expect
index 3d79026..83d4e60 100644
--- a/src/test/test-crs-dynamic-auto-rebalance2/log.expect
+++ b/src/test/test-crs-dynamic-auto-rebalance2/log.expect
@@ -34,7 +34,7 @@ info 21 node1/lrm: starting service vm:104
info 21 node1/lrm: service status vm:104 started
info 22 node2/crm: status change wait_for_quorum => slave
info 24 node3/crm: status change wait_for_quorum => slave
-info 80 node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.41 to 0.94)
+info 80 node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.00 to 0.66)
info 80 node1/crm: got crm command: migrate vm:101 node2
info 80 node1/crm: migrate service 'vm:101' to node 'node2'
info 80 node1/crm: service 'vm:101': state changed from 'started' to 'migrate' (node = node1, target = node2)
@@ -45,7 +45,7 @@ info 83 node2/lrm: status change wait_for_agent_lock => active
info 100 node1/crm: service 'vm:101': state changed from 'migrate' to 'started' (node = node2)
info 103 node2/lrm: starting service vm:101
info 103 node2/lrm: service status vm:101 started
-info 160 node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.94 to 0.35)
+info 160 node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.66 to 0.25)
info 160 node1/crm: got crm command: migrate vm:102 node3
info 160 node1/crm: migrate service 'vm:102' to node 'node3'
info 160 node1/crm: service 'vm:102': state changed from 'started' to 'migrate' (node = node1, target = node3)
diff --git a/src/test/test-crs-dynamic-auto-rebalance3/log.expect b/src/test/test-crs-dynamic-auto-rebalance3/log.expect
index 275f7ae..6f8c1ee 100644
--- a/src/test/test-crs-dynamic-auto-rebalance3/log.expect
+++ b/src/test/test-crs-dynamic-auto-rebalance3/log.expect
@@ -53,7 +53,7 @@ info 25 node3/lrm: service status vm:107 started
info 120 cmdlist: execute service vm:105 set-dynamic-stats cpu 7.8 mem 7912
info 120 cmdlist: execute service vm:106 set-dynamic-stats cpu 5.7 mem 8192
info 120 cmdlist: execute service vm:107 set-dynamic-stats cpu 6.0 mem 8011
-info 160 node1/crm: auto rebalance - migrate vm:105 to node2 (expected change for imbalance from 0.85 to 0.42)
+info 160 node1/crm: auto rebalance - migrate vm:105 to node2 (expected change for imbalance from 0.60 to 0.30)
info 160 node1/crm: got crm command: migrate vm:105 node2
info 160 node1/crm: migrate service 'vm:105' to node 'node2'
info 160 node1/crm: service 'vm:105': state changed from 'started' to 'migrate' (node = node3, target = node2)
@@ -68,22 +68,13 @@ info 220 cmdlist: execute service vm:104 set-dynamic-stats cpu 6.7 mem 8
info 220 cmdlist: execute service vm:105 set-dynamic-stats cpu 1.8 mem 1201
info 220 cmdlist: execute service vm:106 set-dynamic-stats cpu 2.1 mem 1211
info 220 cmdlist: execute service vm:107 set-dynamic-stats cpu 0.9 mem 1191
-info 240 node1/crm: auto rebalance - migrate vm:103 to node1 (expected change for imbalance from 0.81 to 0.40)
-info 240 node1/crm: got crm command: migrate vm:103 node1
-info 240 node1/crm: migrate service 'vm:103' to node 'node1'
-info 240 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node2, target = node1)
-info 243 node2/lrm: service vm:103 - start migrate to node 'node1'
-info 243 node2/lrm: service vm:103 - end migrate to node 'node1'
-info 260 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node1)
-info 261 node1/lrm: starting service vm:103
-info 261 node1/lrm: service status vm:103 started
-info 320 node1/crm: auto rebalance - migrate vm:105 to node3 (expected change for imbalance from 0.40 to 0.21)
-info 320 node1/crm: got crm command: migrate vm:105 node3
-info 320 node1/crm: migrate service 'vm:105' to node 'node3'
-info 320 node1/crm: service 'vm:105': state changed from 'started' to 'migrate' (node = node2, target = node3)
-info 323 node2/lrm: service vm:105 - start migrate to node 'node3'
-info 323 node2/lrm: service vm:105 - end migrate to node 'node3'
-info 340 node1/crm: service 'vm:105': state changed from 'migrate' to 'started' (node = node3)
-info 345 node3/lrm: starting service vm:105
-info 345 node3/lrm: service status vm:105 started
+info 260 node1/crm: auto rebalance - migrate vm:103 to node1 (expected change for imbalance from 0.57 to 0.28)
+info 260 node1/crm: got crm command: migrate vm:103 node1
+info 260 node1/crm: migrate service 'vm:103' to node 'node1'
+info 260 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node2, target = node1)
+info 263 node2/lrm: service vm:103 - start migrate to node 'node1'
+info 263 node2/lrm: service vm:103 - end migrate to node 'node1'
+info 280 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node1)
+info 281 node1/lrm: starting service vm:103
+info 281 node1/lrm: service status vm:103 started
info 820 hardware: exit simulation - done
diff --git a/src/test/test-crs-dynamic-constrained-auto-rebalance1/log.expect b/src/test/test-crs-dynamic-constrained-auto-rebalance1/log.expect
index c926799..30d9721 100644
--- a/src/test/test-crs-dynamic-constrained-auto-rebalance1/log.expect
+++ b/src/test/test-crs-dynamic-constrained-auto-rebalance1/log.expect
@@ -35,7 +35,7 @@ info 120 cmdlist: execute service vm:104 set-static-stats maxcpu 8.0 max
info 120 cmdlist: execute service vm:104 set-dynamic-stats cpu 4.0 mem 4096
info 120 node1/crm: adding new service 'vm:104' on node 'node1'
info 120 node1/crm: service 'vm:104': state changed from 'request_start' to 'started' (node = node1)
-info 140 node1/crm: auto rebalance - migrate vm:104 to node2 (expected change for imbalance from 1.41 to 0.98)
+info 140 node1/crm: auto rebalance - migrate vm:104 to node2 (expected change for imbalance from 1.00 to 0.70)
info 140 node1/crm: got crm command: migrate vm:104 node2
info 140 node1/crm: migrate service 'vm:104' to node 'node2'
info 140 node1/crm: service 'vm:104': state changed from 'started' to 'migrate' (node = node1, target = node2)
diff --git a/src/test/test-crs-dynamic-constrained-auto-rebalance2/log.expect b/src/test/test-crs-dynamic-constrained-auto-rebalance2/log.expect
index 26be942..d9189c9 100644
--- a/src/test/test-crs-dynamic-constrained-auto-rebalance2/log.expect
+++ b/src/test/test-crs-dynamic-constrained-auto-rebalance2/log.expect
@@ -31,7 +31,7 @@ info 120 cmdlist: execute service vm:103 set-static-stats maxcpu 8.0 max
info 120 cmdlist: execute service vm:103 set-dynamic-stats cpu 4.0 mem 4096
info 120 node1/crm: adding new service 'vm:103' on node 'node1'
info 120 node1/crm: service 'vm:103': state changed from 'request_start' to 'started' (node = node1)
-info 140 node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.41 to 0.86)
+info 140 node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.00 to 0.61)
info 140 node1/crm: got crm command: migrate vm:101 node2
info 140 node1/crm: crm command 'migrate vm:101 node2' - migrate service 'vm:102' to node 'node2' (service 'vm:102' in positive affinity with service 'vm:101')
info 140 node1/crm: migrate service 'vm:101' to node 'node2'
diff --git a/src/test/test-crs-dynamic-constrained-auto-rebalance3/log.expect b/src/test/test-crs-dynamic-constrained-auto-rebalance3/log.expect
index 35282c7..82b0b13 100644
--- a/src/test/test-crs-dynamic-constrained-auto-rebalance3/log.expect
+++ b/src/test/test-crs-dynamic-constrained-auto-rebalance3/log.expect
@@ -28,7 +28,7 @@ info 24 node3/crm: status change wait_for_quorum => slave
info 40 node1/crm: service 'vm:101': state changed from 'migrate' to 'started' (node = node1)
info 41 node1/lrm: starting service vm:101
info 41 node1/lrm: service status vm:101 started
-info 60 node1/crm: auto rebalance - migrate vm:102 to node2 (expected change for imbalance from 1.41 to 0.72)
+info 60 node1/crm: auto rebalance - migrate vm:102 to node2 (expected change for imbalance from 1.00 to 0.51)
info 60 node1/crm: got crm command: migrate vm:102 node2
info 60 node1/crm: migrate service 'vm:102' to node 'node2'
info 60 node1/crm: service 'vm:102': state changed from 'started' to 'migrate' (node = node1, target = node2)
@@ -37,7 +37,7 @@ info 61 node1/lrm: service vm:102 - end migrate to node 'node2'
info 80 node1/crm: service 'vm:102': state changed from 'migrate' to 'started' (node = node2)
info 83 node2/lrm: starting service vm:102
info 83 node2/lrm: service status vm:102 started
-info 100 node1/crm: auto rebalance - migrate vm:101 to node3 (expected change for imbalance from 0.72 to 0.27)
+info 100 node1/crm: auto rebalance - migrate vm:101 to node3 (expected change for imbalance from 0.51 to 0.19)
info 100 node1/crm: got crm command: migrate vm:101 node3
info 100 node1/crm: crm command 'migrate vm:101 node3' - migrate service 'vm:103' to node 'node3' (service 'vm:103' in positive affinity with service 'vm:101')
info 100 node1/crm: migrate service 'vm:101' to node 'node3'
diff --git a/src/test/test-crs-dynamic-constrained-auto-rebalance4/log.expect b/src/test/test-crs-dynamic-constrained-auto-rebalance4/log.expect
index cd87f3a..d454328 100644
--- a/src/test/test-crs-dynamic-constrained-auto-rebalance4/log.expect
+++ b/src/test/test-crs-dynamic-constrained-auto-rebalance4/log.expect
@@ -38,7 +38,7 @@ info 25 node3/lrm: got lock 'ha_agent_node3_lock'
info 25 node3/lrm: status change wait_for_agent_lock => active
info 25 node3/lrm: starting service vm:104
info 25 node3/lrm: service status vm:104 started
-info 80 node1/crm: auto rebalance - migrate vm:101 to node3 (expected change for imbalance from 1.04 to 0.72)
+info 80 node1/crm: auto rebalance - migrate vm:101 to node3 (expected change for imbalance from 0.74 to 0.51)
info 80 node1/crm: got crm command: migrate vm:101 node3
info 80 node1/crm: migrate service 'vm:101' to node 'node3'
info 80 node1/crm: service 'vm:101': state changed from 'started' to 'migrate' (node = node1, target = node3)
@@ -47,7 +47,7 @@ info 81 node1/lrm: service vm:101 - end migrate to node 'node3'
info 100 node1/crm: service 'vm:101': state changed from 'migrate' to 'started' (node = node3)
info 105 node3/lrm: starting service vm:101
info 105 node3/lrm: service status vm:101 started
-info 160 node1/crm: auto rebalance - migrate vm:104 to node2 (expected change for imbalance from 0.72 to 0.33)
+info 160 node1/crm: auto rebalance - migrate vm:104 to node2 (expected change for imbalance from 0.51 to 0.23)
info 160 node1/crm: got crm command: migrate vm:104 node2
info 160 node1/crm: migrate service 'vm:104' to node 'node2'
info 160 node1/crm: service 'vm:104': state changed from 'started' to 'migrate' (node = node3, target = node2)
diff --git a/src/test/test-crs-static-auto-rebalance2/log.expect b/src/test/test-crs-static-auto-rebalance2/log.expect
index 6a2ab89..e6d7f7b 100644
--- a/src/test/test-crs-static-auto-rebalance2/log.expect
+++ b/src/test/test-crs-static-auto-rebalance2/log.expect
@@ -34,7 +34,7 @@ info 21 node1/lrm: starting service vm:104
info 21 node1/lrm: service status vm:104 started
info 22 node2/crm: status change wait_for_quorum => slave
info 24 node3/crm: status change wait_for_quorum => slave
-info 80 node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.41 to 0.94)
+info 80 node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.00 to 0.66)
info 80 node1/crm: got crm command: migrate vm:101 node2
info 80 node1/crm: migrate service 'vm:101' to node 'node2'
info 80 node1/crm: service 'vm:101': state changed from 'started' to 'migrate' (node = node1, target = node2)
@@ -45,7 +45,7 @@ info 83 node2/lrm: status change wait_for_agent_lock => active
info 100 node1/crm: service 'vm:101': state changed from 'migrate' to 'started' (node = node2)
info 103 node2/lrm: starting service vm:101
info 103 node2/lrm: service status vm:101 started
-info 160 node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.94 to 0.35)
+info 160 node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.66 to 0.25)
info 160 node1/crm: got crm command: migrate vm:102 node3
info 160 node1/crm: migrate service 'vm:102' to node 'node3'
info 160 node1/crm: service 'vm:102': state changed from 'started' to 'migrate' (node = node1, target = node3)
diff --git a/src/test/test-crs-static-auto-rebalance3/log.expect b/src/test/test-crs-static-auto-rebalance3/log.expect
index ecf2d18..d3a8080 100644
--- a/src/test/test-crs-static-auto-rebalance3/log.expect
+++ b/src/test/test-crs-static-auto-rebalance3/log.expect
@@ -53,7 +53,7 @@ info 25 node3/lrm: service status vm:107 started
info 120 cmdlist: execute service vm:105 set-static-stats maxcpu 8.0 maxmem 8192
info 120 cmdlist: execute service vm:106 set-static-stats maxcpu 8.0 maxmem 8192
info 120 cmdlist: execute service vm:107 set-static-stats maxcpu 8.0 maxmem 8192
-info 160 node1/crm: auto rebalance - migrate vm:105 to node1 (expected change for imbalance from 0.88 to 0.47)
+info 160 node1/crm: auto rebalance - migrate vm:105 to node1 (expected change for imbalance from 0.62 to 0.33)
info 160 node1/crm: got crm command: migrate vm:105 node1
info 160 node1/crm: migrate service 'vm:105' to node 'node1'
info 160 node1/crm: service 'vm:105': state changed from 'started' to 'migrate' (node = node3, target = node1)
@@ -67,7 +67,7 @@ info 220 cmdlist: execute service vm:102 set-static-stats maxcpu 1.0 max
info 220 cmdlist: execute service vm:103 set-static-stats maxcpu 1.0 maxmem 1024
info 220 cmdlist: execute service vm:104 set-static-stats maxcpu 1.0 maxmem 1024
info 220 cmdlist: execute service vm:105 set-static-stats maxcpu 1.0 maxmem 1024
-info 240 node1/crm: auto rebalance - migrate vm:106 to node2 (expected change for imbalance from 0.91 to 0.42)
+info 240 node1/crm: auto rebalance - migrate vm:106 to node2 (expected change for imbalance from 0.64 to 0.30)
info 240 node1/crm: got crm command: migrate vm:106 node2
info 240 node1/crm: migrate service 'vm:106' to node 'node2'
info 240 node1/crm: service 'vm:106': state changed from 'started' to 'migrate' (node = node3, target = node2)
@@ -76,22 +76,4 @@ info 245 node3/lrm: service vm:106 - end migrate to node 'node2'
info 260 node1/crm: service 'vm:106': state changed from 'migrate' to 'started' (node = node2)
info 263 node2/lrm: starting service vm:106
info 263 node2/lrm: service status vm:106 started
-info 320 node1/crm: auto rebalance - migrate vm:103 to node1 (expected change for imbalance from 0.42 to 0.31)
-info 320 node1/crm: got crm command: migrate vm:103 node1
-info 320 node1/crm: migrate service 'vm:103' to node 'node1'
-info 320 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node2, target = node1)
-info 323 node2/lrm: service vm:103 - start migrate to node 'node1'
-info 323 node2/lrm: service vm:103 - end migrate to node 'node1'
-info 340 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node1)
-info 341 node1/lrm: starting service vm:103
-info 341 node1/lrm: service status vm:103 started
-info 400 node1/crm: auto rebalance - migrate vm:104 to node1 (expected change for imbalance from 0.31 to 0.20)
-info 400 node1/crm: got crm command: migrate vm:104 node1
-info 400 node1/crm: migrate service 'vm:104' to node 'node1'
-info 400 node1/crm: service 'vm:104': state changed from 'started' to 'migrate' (node = node2, target = node1)
-info 403 node2/lrm: service vm:104 - start migrate to node 'node1'
-info 403 node2/lrm: service vm:104 - end migrate to node 'node1'
-info 420 node1/crm: service 'vm:104': state changed from 'migrate' to 'started' (node = node1)
-info 421 node1/lrm: starting service vm:104
-info 421 node1/lrm: service status vm:104 started
info 820 hardware: exit simulation - done
--
2.47.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH pve-ha-manager 5/7] manager: add load imbalance to status
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
` (3 preceding siblings ...)
2026-04-27 13:20 ` [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
2026-04-28 9:20 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 6/7] api: status: " Dominik Rusovac
` (2 subsequent siblings)
7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
src/PVE/HA/Manager.pm | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index b69a6bb..ba26fbf 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -285,6 +285,7 @@ sub flush_master_status {
$ms->{node_status} = $ns->{status};
$ms->{service_status} = $ss;
$ms->{timestamp} = $haenv->get_time();
+ $ms->{imbalance} = $self->{online_node_usage}->calculate_node_imbalance();
$haenv->write_manager_status($ms);
}
--
2.47.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH pve-ha-manager 6/7] api: status: add load imbalance to status
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
` (4 preceding siblings ...)
2026-04-27 13:20 ` [PATCH pve-ha-manager 5/7] manager: add load imbalance to status Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
2026-04-28 9:10 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options Dominik Rusovac
2026-04-28 9:21 ` [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Daniel Kral
7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
To: pve-devel
This is a very basic measure to enable users to detect the prevailing
load imbalance in the UI, which atm reveals nothing about the latter.
imo, enabling users to track how the load imbalance changed over time
(using RRD graphs, for example) should be considered, in the long run.
Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
src/PVE/API2/HA/Status.pm | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/src/PVE/API2/HA/Status.pm b/src/PVE/API2/HA/Status.pm
index 4894f3b..acec78e 100644
--- a/src/PVE/API2/HA/Status.pm
+++ b/src/PVE/API2/HA/Status.pm
@@ -199,7 +199,9 @@ __PACKAGE__->register_method({
}
my $datacenter_config = eval { cfs_read_file('datacenter.cfg') } // {};
if (my $crs = $datacenter_config->{crs}) {
- $extra_status .= " - $crs->{ha} load CRS"
+ $extra_status .=
+ " - $crs->{ha} load CRS "
+ . sprintf("(load imbalance: %.2f", 100 * $status->{imbalance}) . "%)"
if $crs->{ha} && $crs->{ha} ne 'basic';
}
my $time_str = localtime($status->{timestamp});
--
2.47.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
` (5 preceding siblings ...)
2026-04-27 13:20 ` [PATCH pve-ha-manager 6/7] api: status: " Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
2026-04-28 8:53 ` Daniel Kral
2026-04-28 9:21 ` [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Daniel Kral
7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
src/PVE/DataCenterConfig.pm | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/PVE/DataCenterConfig.pm b/src/PVE/DataCenterConfig.pm
index 6513594..d120017 100644
--- a/src/PVE/DataCenterConfig.pm
+++ b/src/PVE/DataCenterConfig.pm
@@ -44,6 +44,7 @@ EODESC
type => 'number',
optional => 1,
minimum => 0.0,
+ maximum => 1.0,
default => 0.3,
requires => 'ha-auto-rebalance',
description => "The threshold for the cluster node imbalance, which will"
@@ -72,6 +73,7 @@ EODESC
type => 'number',
optional => 1,
minimum => 0.0,
+ maximum => 1.0,
default => 0.1,
requires => 'ha-auto-rebalance',
description => "The minimum relative improvement in cluster node"
--
2.47.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold
2026-04-27 13:20 ` [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold Dominik Rusovac
@ 2026-04-28 8:52 ` Daniel Kral
0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28 8:52 UTC (permalink / raw)
To: Dominik Rusovac, pve-devel
On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
> www/manager6/form/CRSOptions.js | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/www/manager6/form/CRSOptions.js b/www/manager6/form/CRSOptions.js
> index b5476bd5..985eb8cf 100644
> --- a/www/manager6/form/CRSOptions.js
> +++ b/www/manager6/form/CRSOptions.js
> @@ -66,6 +66,7 @@ Ext.define('PVE.form.CRSOptions', {
> fieldLabel: gettext('Imbalance Threshold'),
> emptyText: '0.3',
> minValue: 0.0,
> + maxValue: 1.0,
> step: 0.01,
> bind: {
> disabled: '{!enableAutoRebalance.checked}',
Nice!
Could be irritating if users have already set this option to a value to
something greater than 1.0, but as it's a very new feature and an
undocumented setting, this shouldn't be that many users.
Consider this as:
Reviewed-by: Daniel Kral <d.kral@proxmox.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values
2026-04-27 13:20 ` [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values Dominik Rusovac
@ 2026-04-28 8:52 ` Daniel Kral
0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28 8:52 UTC (permalink / raw)
To: Dominik Rusovac, pve-devel
On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
This patch should have some commentary in its patch notes why these
values change and why this causes some of the test cases to reduce the
amount of balancing migrations and include the reason in the patch
summary (subject).
AFAICT it's already nice to see here that the selected migrations are
the same, but because of the default imbalance threshold some of the
previously done balancing migrations are cut.
Might be a discussion point to lower the default imbalance threshold
value to roughly the mapped value or if it still is a good default value
for most systems, but that needs more evaluation and is besides this
patch.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options
2026-04-27 13:20 ` [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options Dominik Rusovac
@ 2026-04-28 8:53 ` Daniel Kral
0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28 8:53 UTC (permalink / raw)
To: Dominik Rusovac, pve-devel
On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
> src/PVE/DataCenterConfig.pm | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/src/PVE/DataCenterConfig.pm b/src/PVE/DataCenterConfig.pm
> index 6513594..d120017 100644
> --- a/src/PVE/DataCenterConfig.pm
> +++ b/src/PVE/DataCenterConfig.pm
> @@ -44,6 +44,7 @@ EODESC
> type => 'number',
> optional => 1,
> minimum => 0.0,
> + maximum => 1.0,
> default => 0.3,
> requires => 'ha-auto-rebalance',
> description => "The threshold for the cluster node imbalance, which will"
> @@ -72,6 +73,7 @@ EODESC
> type => 'number',
> optional => 1,
> minimum => 0.0,
> + maximum => 1.0,
Oh right, this should have already been there before as reducing the
imbalance by more than 100 % makes no sense ;-).
> default => 0.1,
> requires => 'ha-auto-rebalance',
> description => "The minimum relative improvement in cluster node"
nit: would be nice to have some patch message here too why those changes
are fine to make it a little more explicit and the changes to margin
syncs with what is already done in the web interface AFAICT.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values
2026-04-27 13:20 ` [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values Dominik Rusovac
@ 2026-04-28 8:53 ` Daniel Kral
0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28 8:53 UTC (permalink / raw)
To: Dominik Rusovac, pve-devel
On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
> proxmox-resource-scheduling/tests/scheduler.rs | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
This patch should be squashed into the previous one to not break the
build and also make it a little easier to follow why the imbalance
values have changed.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value to unit interval
2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
@ 2026-04-28 9:05 ` Daniel Kral
0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28 9:05 UTC (permalink / raw)
To: Dominik Rusovac, pve-devel
On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> The currently used load imbalance value is given as the so-called
> coefficient of variation (CV), a value that may exceed 1. As such, the
> CV value alone lacks meaning. A CV value of 0.0 means no imbalance, but
> what does a value of, say, 1.7 mean?
>
> Relative to the number of nodes in a cluster, it is possible to
> determine the upper bound of the CV value [0][1]. By dividing the CV
> value by its upper bound, the load imbalance can be represented as a
> value that varies between 0 and 1. Expressing the CV as a percentage
> makes the concept of load imbalance easier to interpret.
Nice, thanks for the work!
Will test the changes over the week, but just from the better
readability / interpretability of the imbalance value this should make
it more user-friendly overall.
>
> [0] https://repositorio.ipbeja.pt/server/api/core/bitstreams/8ed9a444-dbe0-402f-9d2f-90c5bf6e418c/content
> [1] https://stats.stackexchange.com/questions/18621/maximum-value-of-coefficient-of-variation-for-bounded-data-set
and a good read overall, thanks!
A note above Example 1 and the proposition 13 from the first paper [0]
is interesting here:
All these properties refer to the case where a single sample is
considered, however, as noted by [16], Dodd’s corrected coefficient
of variation, CV corr is not suitable for comparative purpose, as
can be seen in the next example.
Example 1. [...]
[...]
Proposition 13. Dodd’s corrected coefficient of variation, CV corr,
is sample-size sensitive.
AFAICT this should not be a problem for us to compare these values to
make decisions since we only compare the values for the same cluster
size configuration and therefore this shouldn't affect us badly.
On another note, it should also be easier for users to set a imbalance
threshold value, which is (at least roughly) invariant to the size of
the cluster. If one or more nodes failed or are in maintenance node,
this dramatically changes the decisions that the load balancer can make
anyway, but I wonder whether how much difference it makes to the
sensitivity to trigger the load balancing system.
>
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
> proxmox-resource-scheduling/src/scheduler.rs | 33 +++++++++++++-------
> 1 file changed, 21 insertions(+), 12 deletions(-)
>
> diff --git a/proxmox-resource-scheduling/src/scheduler.rs b/proxmox-resource-scheduling/src/scheduler.rs
> index 49d16f9f..4eacbff9 100644
> --- a/proxmox-resource-scheduling/src/scheduler.rs
> +++ b/proxmox-resource-scheduling/src/scheduler.rs
> @@ -17,17 +17,23 @@ pub struct NodeUsage {
> pub stats: NodeStats,
> }
>
> -/// Returns the load imbalance among the nodes.
> +/// Returns the load imbalance among the nodes, which is a value between 0 and 1 that describes the
> +/// statistical dispersion of the individual node loads around the mean node load. The lower the
> +/// value, the better.
> ///
> -/// The load balance is measured as the statistical dispersion of the individual node loads.
> -///
> -/// The current implementation uses the dimensionless coefficient of variation, which expresses the
> -/// standard deviation in relation to the average mean of the node loads.
> -///
> -/// The coefficient of variation is not robust, which is a desired property here, because outliers
> -/// should be detected as much as possible.
> +/// In more detail, the current implementation computes the so-called coefficient of variation (CV),
> +/// which is the ratio of the standard deviation to the mean of the given node loads. The lower
> +/// bound of the CV is reached if all node loads are equal. The upper bound is reached if all nodes
> +/// except one are idle. To present the CV as a value between 0 and 1, it's being divided by the
> +/// upper bound of the CV for the given number of nodes.
> fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&NodeUsage) -> f64) -> f64 {
> - let node_count = nodes.len();
> + let node_count = nodes.len() as f64;
even though this reduces the amount of 'as f64's below, the node count
is by its nature a positive integer so it should stay that way.
> +
> + // imbalance is perfect for less than 2 nodes
> + if node_count < 2.0 {
> + return 0.0;
> + }
this could replace the check `load_sum == 0.0` below, which also makes
sure that we don't ever divide by zero (and return a NaN as a result),
and move the assignments for node_loads and load_sum into the else
branch's code path.
A comment could make it more explicit that this avoids dividing by zero.
> +
> let node_loads = nodes.iter().map(to_load).collect::<Vec<_>>();
>
> let load_sum = node_loads.iter().sum::<f64>();
> @@ -36,14 +42,17 @@ fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&NodeUsage) ->
> if load_sum == 0.0 {
> 0.0
> } else {
> - let load_mean = load_sum / node_count as f64;
> + let load_mean = load_sum / node_count;
>
> let squared_diff_sum = node_loads
> .iter()
> .fold(0.0, |sum, node_load| sum + (node_load - load_mean).powi(2));
> - let load_sd = (squared_diff_sum / node_count as f64).sqrt();
> + let load_sd = (squared_diff_sum / node_count).sqrt();
> +
> + let max_cv = (node_count - 1.0).sqrt();
> + let cv = load_sd / load_mean;
nit: just for aesthetics, could be reordered to cv and then max_cv
also to not loose the reference from the patch message with future
changes, it would be nice to add a comment here why the calculation for
max_cv is correct.
>
> - load_sd / load_mean
> + cv / max_cv
> }
> }
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-ha-manager 6/7] api: status: add load imbalance to status
2026-04-27 13:20 ` [PATCH pve-ha-manager 6/7] api: status: " Dominik Rusovac
@ 2026-04-28 9:10 ` Daniel Kral
0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28 9:10 UTC (permalink / raw)
To: Dominik Rusovac, pve-devel
On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> This is a very basic measure to enable users to detect the prevailing
> load imbalance in the UI, which atm reveals nothing about the latter.
>
> imo, enabling users to track how the load imbalance changed over time
> (using RRD graphs, for example) should be considered, in the long run.
>
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
> src/PVE/API2/HA/Status.pm | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/src/PVE/API2/HA/Status.pm b/src/PVE/API2/HA/Status.pm
> index 4894f3b..acec78e 100644
> --- a/src/PVE/API2/HA/Status.pm
> +++ b/src/PVE/API2/HA/Status.pm
> @@ -199,7 +199,9 @@ __PACKAGE__->register_method({
> }
> my $datacenter_config = eval { cfs_read_file('datacenter.cfg') } // {};
> if (my $crs = $datacenter_config->{crs}) {
> - $extra_status .= " - $crs->{ha} load CRS"
> + $extra_status .=
> + " - $crs->{ha} load CRS "
> + . sprintf("(load imbalance: %.2f", 100 * $status->{imbalance}) . "%)"
> if $crs->{ha} && $crs->{ha} ne 'basic';
I think this should also check whether the load balancing system is
enabled to not clutter the status string if it's unused or no action is
taken if this value is high, but no hard feelings.
> }
> my $time_str = localtime($status->{timestamp});
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-ha-manager 5/7] manager: add load imbalance to status
2026-04-27 13:20 ` [PATCH pve-ha-manager 5/7] manager: add load imbalance to status Dominik Rusovac
@ 2026-04-28 9:20 ` Daniel Kral
0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28 9:20 UTC (permalink / raw)
To: Dominik Rusovac, pve-devel
On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
> src/PVE/HA/Manager.pm | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index b69a6bb..ba26fbf 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -285,6 +285,7 @@ sub flush_master_status {
> $ms->{node_status} = $ns->{status};
> $ms->{service_status} = $ss;
> $ms->{timestamp} = $haenv->get_time();
> + $ms->{imbalance} = $self->{online_node_usage}->calculate_node_imbalance();
Nice! This should allow some better look on the current state of the
load balancer and how well it performs w.r.t. to the imbalance
threshold.
>
> $haenv->write_manager_status($ms);
> }
Consider as:
Reviewed-by: Daniel Kral <d.kral@proxmox.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
` (6 preceding siblings ...)
2026-04-27 13:20 ` [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options Dominik Rusovac
@ 2026-04-28 9:21 ` Daniel Kral
7 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28 9:21 UTC (permalink / raw)
To: Dominik Rusovac, pve-devel
On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> # TL;DR
> clamp load imbalance to value between 0 and 1, and display the value as
> percentage in HA Status panel of PVE UI.
>
> # Details
> The currently used load imbalance value is given as the so-called coefficient of
> variation (CV), a value that may exceed 1. As such, the CV value alone lacks
> meaning. A CV value of 0.0 means no imbalance, but what does a value of, say,
> 1.7 mean?
>
> Relative to the number of nodes in a cluster, it is possible to determine the
> upper bound of the CV value [0][1]. By dividing the CV value by its upper
> bound, the load imbalance can be represented as a value that varies between 0
> and 1. Expressing the CV as a percentage makes the concept of load imbalance
> easier to interpret.
>
> # Summary of Changes
> This series:
> - represents load imbalance as a value between 0 and 1;
> - adds a maximum value of 1.0 for load scheduler options; and
> - integrates the load imbalance value within the HA status endpoint;
> this is to provide feedback on the prevailing load imbalance in the PVE UI.
As discussed off-list, it would be interesting to also keep a history of
the imbalance value for the cluster.
In that discussion we also wondered whether we could derive that history
without changing the rrdcached schema at all by fetching the
average/maximum values for the already pre-defined time frames (each
minute, each hour, etc.) and use the same calculate_node_imbalance(),
but just for the raw values.
Haven't checked how much error this does introduce since the rrdcached
values are different from the sampled values fetched from the rrddump in
the HA Manager simply because these are averaged out, but it would be
interesting if the introduced error is negible enough.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-04-28 9:21 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
2026-04-28 9:05 ` Daniel Kral
2026-04-27 13:20 ` [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values Dominik Rusovac
2026-04-28 8:53 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold Dominik Rusovac
2026-04-28 8:52 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values Dominik Rusovac
2026-04-28 8:52 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 5/7] manager: add load imbalance to status Dominik Rusovac
2026-04-28 9:20 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 6/7] api: status: " Dominik Rusovac
2026-04-28 9:10 ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options Dominik Rusovac
2026-04-28 8:53 ` Daniel Kral
2026-04-28 9:21 ` [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Daniel Kral
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.