[RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval
@ 2026-04-27 13:20 Dominik Rusovac
  2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
  To: pve-devel

# TL;DR 
clamp load imbalance to value between 0 and 1, and display the value as
percentage in HA Status panel of PVE UI. 

# Details
The currently used load imbalance value is given as the so-called coefficient of
variation (CV), a value that may exceed 1. As such, the CV value alone lacks
meaning. A CV value of 0.0 means no imbalance, but what does a value of, say,
1.7 mean?

Relative to the number of nodes in a cluster, it is possible to determine the
upper bound of the CV value [0][1]. By dividing the CV value by its upper
bound, the load imbalance can be represented as a value that varies between 0
and 1. Expressing the CV as a percentage makes the concept of load imbalance
easier to interpret.

# Summary of Changes
This series:
- represents load imbalance as a value between 0 and 1;
- adds a maximum value of 1.0 for load scheduler options; and
- integrates the load imbalance value within the HA status endpoint; 
  this is to provide feedback on the prevailing load imbalance in the PVE UI.

# Refs
[0] https://repositorio.ipbeja.pt/server/api/core/bitstreams/8ed9a444-dbe0-402f-9d2f-90c5bf6e418c/content
[1] https://stats.stackexchange.com/questions/18621/maximum-value-of-coefficient-of-variation-for-bounded-data-set


proxmox:

Dominik Rusovac (2):
  resource-scheduling: clamp imbalance value to unit interval
  resource-scheduling: re-adjust hardcoded imbalance values

 proxmox-resource-scheduling/src/scheduler.rs  | 33 ++++++++++++-------
 .../tests/scheduler.rs                        |  8 ++---
 2 files changed, 25 insertions(+), 16 deletions(-)


pve-manager:

Dominik Rusovac (1):
  ui: from/CRSOptions: add maximum for threshold

 www/manager6/form/CRSOptions.js | 1 +
 1 file changed, 1 insertion(+)


pve-ha-manager:

Dominik Rusovac (3):
  test: re-adjust logged imbalance values
  manager: add load imbalance to status
  api: status: add load imbalance to status

 src/PVE/API2/HA/Status.pm                     |  4 +-
 src/PVE/HA/Manager.pm                         |  1 +
 .../log.expect                                |  4 +-
 .../log.expect                                | 38 +++++++++----------
 .../log.expect                                |  4 +-
 .../log.expect                                | 29 +++++---------
 .../log.expect                                |  2 +-
 .../log.expect                                |  2 +-
 .../log.expect                                |  4 +-
 .../log.expect                                |  4 +-
 .../log.expect                                |  4 +-
 .../log.expect                                | 22 +----------
 12 files changed, 47 insertions(+), 71 deletions(-)


pve-cluster:

Dominik Rusovac (1):
  datacenter config: add maxima for load scheduler options

 src/PVE/DataCenterConfig.pm | 2 ++
 1 file changed, 2 insertions(+)


Summary over all repositories:
  16 files changed, 75 insertions(+), 87 deletions(-)

-- 
Generated by murpp 0.11.0




^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value to unit interval
  2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
  2026-04-28  9:05   ` Daniel Kral
  2026-04-27 13:20 ` [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values Dominik Rusovac
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
  To: pve-devel

The currently used load imbalance value is given as the so-called
coefficient of variation (CV), a value that may exceed 1. As such, the
CV value alone lacks meaning. A CV value of 0.0 means no imbalance, but
what does a value of, say, 1.7 mean?

Relative to the number of nodes in a cluster, it is possible to
determine the upper bound of the CV value [0][1]. By dividing the CV
value by its upper bound, the load imbalance can be represented as a
value that varies between 0 and 1. Expressing the CV as a percentage
makes the concept of load imbalance easier to interpret.

[0] https://repositorio.ipbeja.pt/server/api/core/bitstreams/8ed9a444-dbe0-402f-9d2f-90c5bf6e418c/content
[1] https://stats.stackexchange.com/questions/18621/maximum-value-of-coefficient-of-variation-for-bounded-data-set

Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
 proxmox-resource-scheduling/src/scheduler.rs | 33 +++++++++++++-------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/proxmox-resource-scheduling/src/scheduler.rs b/proxmox-resource-scheduling/src/scheduler.rs
index 49d16f9f..4eacbff9 100644
--- a/proxmox-resource-scheduling/src/scheduler.rs
+++ b/proxmox-resource-scheduling/src/scheduler.rs
@@ -17,17 +17,23 @@ pub struct NodeUsage {
     pub stats: NodeStats,
 }
 
-/// Returns the load imbalance among the nodes.
+/// Returns the load imbalance among the nodes, which is a value between 0 and 1 that describes the
+/// statistical dispersion of the individual node loads around the mean node load. The lower the
+/// value, the better.
 ///
-/// The load balance is measured as the statistical dispersion of the individual node loads.
-///
-/// The current implementation uses the dimensionless coefficient of variation, which expresses the
-/// standard deviation in relation to the average mean of the node loads.
-///
-/// The coefficient of variation is not robust, which is a desired property here, because outliers
-/// should be detected as much as possible.
+/// In more detail, the current implementation computes the so-called coefficient of variation (CV),
+/// which is the ratio of the standard deviation to the mean of the given node loads. The lower
+/// bound of the CV is reached if all node loads are equal. The upper bound is reached if all nodes
+/// except one are idle. To present the CV as a value between 0 and 1, it's being divided by the
+/// upper bound of the CV for the given number of nodes.
 fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&NodeUsage) -> f64) -> f64 {
-    let node_count = nodes.len();
+    let node_count = nodes.len() as f64;
+
+    // imbalance is perfect for less than 2 nodes
+    if node_count < 2.0 {
+        return 0.0;
+    }
+
     let node_loads = nodes.iter().map(to_load).collect::<Vec<_>>();
 
     let load_sum = node_loads.iter().sum::<f64>();
@@ -36,14 +42,17 @@ fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&NodeUsage) ->
     if load_sum == 0.0 {
         0.0
     } else {
-        let load_mean = load_sum / node_count as f64;
+        let load_mean = load_sum / node_count;
 
         let squared_diff_sum = node_loads
             .iter()
             .fold(0.0, |sum, node_load| sum + (node_load - load_mean).powi(2));
-        let load_sd = (squared_diff_sum / node_count as f64).sqrt();
+        let load_sd = (squared_diff_sum / node_count).sqrt();
+
+        let max_cv = (node_count - 1.0).sqrt();
+        let cv = load_sd / load_mean;
 
-        load_sd / load_mean
+        cv / max_cv
     }
 }
 
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values
  2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
  2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
  2026-04-28  8:53   ` Daniel Kral
  2026-04-27 13:20 ` [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold Dominik Rusovac
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
 proxmox-resource-scheduling/tests/scheduler.rs | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/proxmox-resource-scheduling/tests/scheduler.rs b/proxmox-resource-scheduling/tests/scheduler.rs
index be90e4f9..21dbe451 100644
--- a/proxmox-resource-scheduling/tests/scheduler.rs
+++ b/proxmox-resource-scheduling/tests/scheduler.rs
@@ -172,7 +172,7 @@ fn test_score_best_balancing_migration_candidates_with_no_candidates() {
 fn test_score_best_balancing_migration_candidates_in_homogeneous_cluster() {
     let scheduler = new_homogeneous_cluster_scheduler();
 
-    assert_imbalance(scheduler.node_imbalance(), 0.4893954724628247);
+    assert_imbalance(scheduler.node_imbalance(), 0.3460548572604576);
 
     let (candidates, migration1, migration2) = new_simple_migration_candidates();
 
@@ -186,7 +186,7 @@ fn test_score_best_balancing_migration_candidates_in_homogeneous_cluster() {
 fn test_score_best_balancing_migration_candidates_in_heterogeneous_cluster() {
     let scheduler = new_heterogeneous_cluster_scheduler();
 
-    assert_imbalance(scheduler.node_imbalance(), 0.33026013056867354);
+    assert_imbalance(scheduler.node_imbalance(), 0.23352917788066363);
 
     let (candidates, migration1, migration2) = new_simple_migration_candidates();
 
@@ -225,7 +225,7 @@ fn test_score_best_balancing_migration_candidates_topsis_in_homogeneous_cluster(
 ) -> Result<(), Error> {
     let scheduler = new_homogeneous_cluster_scheduler();
 
-    assert_imbalance(scheduler.node_imbalance(), 0.4893954724628247);
+    assert_imbalance(scheduler.node_imbalance(), 0.3460548572604576);
 
     let (candidates, migration1, migration2) = new_simple_migration_candidates();
 
@@ -242,7 +242,7 @@ fn test_score_best_balancing_migration_candidates_topsis_in_heterogeneous_cluste
 ) -> Result<(), Error> {
     let scheduler = new_heterogeneous_cluster_scheduler();
 
-    assert_imbalance(scheduler.node_imbalance(), 0.33026013056867354);
+    assert_imbalance(scheduler.node_imbalance(), 0.23352917788066363);
 
     let (candidates, migration1, migration2) = new_simple_migration_candidates();
 
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold
  2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
  2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
  2026-04-27 13:20 ` [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
  2026-04-28  8:52   ` Daniel Kral
  2026-04-27 13:20 ` [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values Dominik Rusovac
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
 www/manager6/form/CRSOptions.js | 1 +
 1 file changed, 1 insertion(+)

diff --git a/www/manager6/form/CRSOptions.js b/www/manager6/form/CRSOptions.js
index b5476bd5..985eb8cf 100644
--- a/www/manager6/form/CRSOptions.js
+++ b/www/manager6/form/CRSOptions.js
@@ -66,6 +66,7 @@ Ext.define('PVE.form.CRSOptions', {
                     fieldLabel: gettext('Imbalance Threshold'),
                     emptyText: '0.3',
                     minValue: 0.0,
+                    maxValue: 1.0,
                     step: 0.01,
                     bind: {
                         disabled: '{!enableAutoRebalance.checked}',
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values
  2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
                   ` (2 preceding siblings ...)
  2026-04-27 13:20 ` [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
  2026-04-28  8:52   ` Daniel Kral
  2026-04-27 13:20 ` [PATCH pve-ha-manager 5/7] manager: add load imbalance to status Dominik Rusovac
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
 .../log.expect                                |  4 +-
 .../log.expect                                | 38 +++++++++----------
 .../log.expect                                |  4 +-
 .../log.expect                                | 29 +++++---------
 .../log.expect                                |  2 +-
 .../log.expect                                |  2 +-
 .../log.expect                                |  4 +-
 .../log.expect                                |  4 +-
 .../log.expect                                |  4 +-
 .../log.expect                                | 22 +----------
 10 files changed, 43 insertions(+), 70 deletions(-)

diff --git a/src/test/test-crs-dynamic-auto-rebalance-topsis2/log.expect b/src/test/test-crs-dynamic-auto-rebalance-topsis2/log.expect
index 3d79026..83d4e60 100644
--- a/src/test/test-crs-dynamic-auto-rebalance-topsis2/log.expect
+++ b/src/test/test-crs-dynamic-auto-rebalance-topsis2/log.expect
@@ -34,7 +34,7 @@ info     21    node1/lrm: starting service vm:104
 info     21    node1/lrm: service status vm:104 started
 info     22    node2/crm: status change wait_for_quorum => slave
 info     24    node3/crm: status change wait_for_quorum => slave
-info     80    node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.41 to 0.94)
+info     80    node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.00 to 0.66)
 info     80    node1/crm: got crm command: migrate vm:101 node2
 info     80    node1/crm: migrate service 'vm:101' to node 'node2'
 info     80    node1/crm: service 'vm:101': state changed from 'started' to 'migrate'  (node = node1, target = node2)
@@ -45,7 +45,7 @@ info     83    node2/lrm: status change wait_for_agent_lock => active
 info    100    node1/crm: service 'vm:101': state changed from 'migrate' to 'started'  (node = node2)
 info    103    node2/lrm: starting service vm:101
 info    103    node2/lrm: service status vm:101 started
-info    160    node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.94 to 0.35)
+info    160    node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.66 to 0.25)
 info    160    node1/crm: got crm command: migrate vm:102 node3
 info    160    node1/crm: migrate service 'vm:102' to node 'node3'
 info    160    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node1, target = node3)
diff --git a/src/test/test-crs-dynamic-auto-rebalance-topsis3/log.expect b/src/test/test-crs-dynamic-auto-rebalance-topsis3/log.expect
index c9fc29e..c539122 100644
--- a/src/test/test-crs-dynamic-auto-rebalance-topsis3/log.expect
+++ b/src/test/test-crs-dynamic-auto-rebalance-topsis3/log.expect
@@ -53,7 +53,7 @@ info     25    node3/lrm: service status vm:107 started
 info    120      cmdlist: execute service vm:105 set-dynamic-stats cpu 7.8 mem 7912
 info    120      cmdlist: execute service vm:106 set-dynamic-stats cpu 5.7 mem 8192
 info    120      cmdlist: execute service vm:107 set-dynamic-stats cpu 6.0 mem 8011
-info    160    node1/crm: auto rebalance - migrate vm:105 to node2 (expected change for imbalance from 0.85 to 0.42)
+info    160    node1/crm: auto rebalance - migrate vm:105 to node2 (expected change for imbalance from 0.60 to 0.30)
 info    160    node1/crm: got crm command: migrate vm:105 node2
 info    160    node1/crm: migrate service 'vm:105' to node 'node2'
 info    160    node1/crm: service 'vm:105': state changed from 'started' to 'migrate'  (node = node3, target = node2)
@@ -68,22 +68,22 @@ info    220      cmdlist: execute service vm:104 set-dynamic-stats cpu 6.7 mem 8
 info    220      cmdlist: execute service vm:105 set-dynamic-stats cpu 1.8 mem 1201
 info    220      cmdlist: execute service vm:106 set-dynamic-stats cpu 2.1 mem 1211
 info    220      cmdlist: execute service vm:107 set-dynamic-stats cpu 0.9 mem 1191
-info    240    node1/crm: auto rebalance - migrate vm:103 to node3 (expected change for imbalance from 0.81 to 0.43)
-info    240    node1/crm: got crm command: migrate vm:103 node3
-info    240    node1/crm: migrate service 'vm:103' to node 'node3'
-info    240    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node2, target = node3)
-info    243    node2/lrm: service vm:103 - start migrate to node 'node3'
-info    243    node2/lrm: service vm:103 - end migrate to node 'node3'
-info    260    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node3)
-info    265    node3/lrm: starting service vm:103
-info    265    node3/lrm: service status vm:103 started
-info    320    node1/crm: auto rebalance - migrate vm:105 to node1 (expected change for imbalance from 0.43 to 0.24)
-info    320    node1/crm: got crm command: migrate vm:105 node1
-info    320    node1/crm: migrate service 'vm:105' to node 'node1'
-info    320    node1/crm: service 'vm:105': state changed from 'started' to 'migrate'  (node = node2, target = node1)
-info    323    node2/lrm: service vm:105 - start migrate to node 'node1'
-info    323    node2/lrm: service vm:105 - end migrate to node 'node1'
-info    340    node1/crm: service 'vm:105': state changed from 'migrate' to 'started'  (node = node1)
-info    341    node1/lrm: starting service vm:105
-info    341    node1/lrm: service status vm:105 started
+info    260    node1/crm: auto rebalance - migrate vm:103 to node3 (expected change for imbalance from 0.57 to 0.30)
+info    260    node1/crm: got crm command: migrate vm:103 node3
+info    260    node1/crm: migrate service 'vm:103' to node 'node3'
+info    260    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node2, target = node3)
+info    263    node2/lrm: service vm:103 - start migrate to node 'node3'
+info    263    node2/lrm: service vm:103 - end migrate to node 'node3'
+info    280    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node3)
+info    285    node3/lrm: starting service vm:103
+info    285    node3/lrm: service status vm:103 started
+info    340    node1/crm: auto rebalance - migrate vm:105 to node1 (expected change for imbalance from 0.30 to 0.17)
+info    340    node1/crm: got crm command: migrate vm:105 node1
+info    340    node1/crm: migrate service 'vm:105' to node 'node1'
+info    340    node1/crm: service 'vm:105': state changed from 'started' to 'migrate'  (node = node2, target = node1)
+info    343    node2/lrm: service vm:105 - start migrate to node 'node1'
+info    343    node2/lrm: service vm:105 - end migrate to node 'node1'
+info    360    node1/crm: service 'vm:105': state changed from 'migrate' to 'started'  (node = node1)
+info    361    node1/lrm: starting service vm:105
+info    361    node1/lrm: service status vm:105 started
 info    820     hardware: exit simulation - done
diff --git a/src/test/test-crs-dynamic-auto-rebalance2/log.expect b/src/test/test-crs-dynamic-auto-rebalance2/log.expect
index 3d79026..83d4e60 100644
--- a/src/test/test-crs-dynamic-auto-rebalance2/log.expect
+++ b/src/test/test-crs-dynamic-auto-rebalance2/log.expect
@@ -34,7 +34,7 @@ info     21    node1/lrm: starting service vm:104
 info     21    node1/lrm: service status vm:104 started
 info     22    node2/crm: status change wait_for_quorum => slave
 info     24    node3/crm: status change wait_for_quorum => slave
-info     80    node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.41 to 0.94)
+info     80    node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.00 to 0.66)
 info     80    node1/crm: got crm command: migrate vm:101 node2
 info     80    node1/crm: migrate service 'vm:101' to node 'node2'
 info     80    node1/crm: service 'vm:101': state changed from 'started' to 'migrate'  (node = node1, target = node2)
@@ -45,7 +45,7 @@ info     83    node2/lrm: status change wait_for_agent_lock => active
 info    100    node1/crm: service 'vm:101': state changed from 'migrate' to 'started'  (node = node2)
 info    103    node2/lrm: starting service vm:101
 info    103    node2/lrm: service status vm:101 started
-info    160    node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.94 to 0.35)
+info    160    node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.66 to 0.25)
 info    160    node1/crm: got crm command: migrate vm:102 node3
 info    160    node1/crm: migrate service 'vm:102' to node 'node3'
 info    160    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node1, target = node3)
diff --git a/src/test/test-crs-dynamic-auto-rebalance3/log.expect b/src/test/test-crs-dynamic-auto-rebalance3/log.expect
index 275f7ae..6f8c1ee 100644
--- a/src/test/test-crs-dynamic-auto-rebalance3/log.expect
+++ b/src/test/test-crs-dynamic-auto-rebalance3/log.expect
@@ -53,7 +53,7 @@ info     25    node3/lrm: service status vm:107 started
 info    120      cmdlist: execute service vm:105 set-dynamic-stats cpu 7.8 mem 7912
 info    120      cmdlist: execute service vm:106 set-dynamic-stats cpu 5.7 mem 8192
 info    120      cmdlist: execute service vm:107 set-dynamic-stats cpu 6.0 mem 8011
-info    160    node1/crm: auto rebalance - migrate vm:105 to node2 (expected change for imbalance from 0.85 to 0.42)
+info    160    node1/crm: auto rebalance - migrate vm:105 to node2 (expected change for imbalance from 0.60 to 0.30)
 info    160    node1/crm: got crm command: migrate vm:105 node2
 info    160    node1/crm: migrate service 'vm:105' to node 'node2'
 info    160    node1/crm: service 'vm:105': state changed from 'started' to 'migrate'  (node = node3, target = node2)
@@ -68,22 +68,13 @@ info    220      cmdlist: execute service vm:104 set-dynamic-stats cpu 6.7 mem 8
 info    220      cmdlist: execute service vm:105 set-dynamic-stats cpu 1.8 mem 1201
 info    220      cmdlist: execute service vm:106 set-dynamic-stats cpu 2.1 mem 1211
 info    220      cmdlist: execute service vm:107 set-dynamic-stats cpu 0.9 mem 1191
-info    240    node1/crm: auto rebalance - migrate vm:103 to node1 (expected change for imbalance from 0.81 to 0.40)
-info    240    node1/crm: got crm command: migrate vm:103 node1
-info    240    node1/crm: migrate service 'vm:103' to node 'node1'
-info    240    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node2, target = node1)
-info    243    node2/lrm: service vm:103 - start migrate to node 'node1'
-info    243    node2/lrm: service vm:103 - end migrate to node 'node1'
-info    260    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
-info    261    node1/lrm: starting service vm:103
-info    261    node1/lrm: service status vm:103 started
-info    320    node1/crm: auto rebalance - migrate vm:105 to node3 (expected change for imbalance from 0.40 to 0.21)
-info    320    node1/crm: got crm command: migrate vm:105 node3
-info    320    node1/crm: migrate service 'vm:105' to node 'node3'
-info    320    node1/crm: service 'vm:105': state changed from 'started' to 'migrate'  (node = node2, target = node3)
-info    323    node2/lrm: service vm:105 - start migrate to node 'node3'
-info    323    node2/lrm: service vm:105 - end migrate to node 'node3'
-info    340    node1/crm: service 'vm:105': state changed from 'migrate' to 'started'  (node = node3)
-info    345    node3/lrm: starting service vm:105
-info    345    node3/lrm: service status vm:105 started
+info    260    node1/crm: auto rebalance - migrate vm:103 to node1 (expected change for imbalance from 0.57 to 0.28)
+info    260    node1/crm: got crm command: migrate vm:103 node1
+info    260    node1/crm: migrate service 'vm:103' to node 'node1'
+info    260    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node2, target = node1)
+info    263    node2/lrm: service vm:103 - start migrate to node 'node1'
+info    263    node2/lrm: service vm:103 - end migrate to node 'node1'
+info    280    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
+info    281    node1/lrm: starting service vm:103
+info    281    node1/lrm: service status vm:103 started
 info    820     hardware: exit simulation - done
diff --git a/src/test/test-crs-dynamic-constrained-auto-rebalance1/log.expect b/src/test/test-crs-dynamic-constrained-auto-rebalance1/log.expect
index c926799..30d9721 100644
--- a/src/test/test-crs-dynamic-constrained-auto-rebalance1/log.expect
+++ b/src/test/test-crs-dynamic-constrained-auto-rebalance1/log.expect
@@ -35,7 +35,7 @@ info    120      cmdlist: execute service vm:104 set-static-stats maxcpu 8.0 max
 info    120      cmdlist: execute service vm:104 set-dynamic-stats cpu 4.0 mem 4096
 info    120    node1/crm: adding new service 'vm:104' on node 'node1'
 info    120    node1/crm: service 'vm:104': state changed from 'request_start' to 'started'  (node = node1)
-info    140    node1/crm: auto rebalance - migrate vm:104 to node2 (expected change for imbalance from 1.41 to 0.98)
+info    140    node1/crm: auto rebalance - migrate vm:104 to node2 (expected change for imbalance from 1.00 to 0.70)
 info    140    node1/crm: got crm command: migrate vm:104 node2
 info    140    node1/crm: migrate service 'vm:104' to node 'node2'
 info    140    node1/crm: service 'vm:104': state changed from 'started' to 'migrate'  (node = node1, target = node2)
diff --git a/src/test/test-crs-dynamic-constrained-auto-rebalance2/log.expect b/src/test/test-crs-dynamic-constrained-auto-rebalance2/log.expect
index 26be942..d9189c9 100644
--- a/src/test/test-crs-dynamic-constrained-auto-rebalance2/log.expect
+++ b/src/test/test-crs-dynamic-constrained-auto-rebalance2/log.expect
@@ -31,7 +31,7 @@ info    120      cmdlist: execute service vm:103 set-static-stats maxcpu 8.0 max
 info    120      cmdlist: execute service vm:103 set-dynamic-stats cpu 4.0 mem 4096
 info    120    node1/crm: adding new service 'vm:103' on node 'node1'
 info    120    node1/crm: service 'vm:103': state changed from 'request_start' to 'started'  (node = node1)
-info    140    node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.41 to 0.86)
+info    140    node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.00 to 0.61)
 info    140    node1/crm: got crm command: migrate vm:101 node2
 info    140    node1/crm: crm command 'migrate vm:101 node2' - migrate service 'vm:102' to node 'node2' (service 'vm:102' in positive affinity with service 'vm:101')
 info    140    node1/crm: migrate service 'vm:101' to node 'node2'
diff --git a/src/test/test-crs-dynamic-constrained-auto-rebalance3/log.expect b/src/test/test-crs-dynamic-constrained-auto-rebalance3/log.expect
index 35282c7..82b0b13 100644
--- a/src/test/test-crs-dynamic-constrained-auto-rebalance3/log.expect
+++ b/src/test/test-crs-dynamic-constrained-auto-rebalance3/log.expect
@@ -28,7 +28,7 @@ info     24    node3/crm: status change wait_for_quorum => slave
 info     40    node1/crm: service 'vm:101': state changed from 'migrate' to 'started'  (node = node1)
 info     41    node1/lrm: starting service vm:101
 info     41    node1/lrm: service status vm:101 started
-info     60    node1/crm: auto rebalance - migrate vm:102 to node2 (expected change for imbalance from 1.41 to 0.72)
+info     60    node1/crm: auto rebalance - migrate vm:102 to node2 (expected change for imbalance from 1.00 to 0.51)
 info     60    node1/crm: got crm command: migrate vm:102 node2
 info     60    node1/crm: migrate service 'vm:102' to node 'node2'
 info     60    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node1, target = node2)
@@ -37,7 +37,7 @@ info     61    node1/lrm: service vm:102 - end migrate to node 'node2'
 info     80    node1/crm: service 'vm:102': state changed from 'migrate' to 'started'  (node = node2)
 info     83    node2/lrm: starting service vm:102
 info     83    node2/lrm: service status vm:102 started
-info    100    node1/crm: auto rebalance - migrate vm:101 to node3 (expected change for imbalance from 0.72 to 0.27)
+info    100    node1/crm: auto rebalance - migrate vm:101 to node3 (expected change for imbalance from 0.51 to 0.19)
 info    100    node1/crm: got crm command: migrate vm:101 node3
 info    100    node1/crm: crm command 'migrate vm:101 node3' - migrate service 'vm:103' to node 'node3' (service 'vm:103' in positive affinity with service 'vm:101')
 info    100    node1/crm: migrate service 'vm:101' to node 'node3'
diff --git a/src/test/test-crs-dynamic-constrained-auto-rebalance4/log.expect b/src/test/test-crs-dynamic-constrained-auto-rebalance4/log.expect
index cd87f3a..d454328 100644
--- a/src/test/test-crs-dynamic-constrained-auto-rebalance4/log.expect
+++ b/src/test/test-crs-dynamic-constrained-auto-rebalance4/log.expect
@@ -38,7 +38,7 @@ info     25    node3/lrm: got lock 'ha_agent_node3_lock'
 info     25    node3/lrm: status change wait_for_agent_lock => active
 info     25    node3/lrm: starting service vm:104
 info     25    node3/lrm: service status vm:104 started
-info     80    node1/crm: auto rebalance - migrate vm:101 to node3 (expected change for imbalance from 1.04 to 0.72)
+info     80    node1/crm: auto rebalance - migrate vm:101 to node3 (expected change for imbalance from 0.74 to 0.51)
 info     80    node1/crm: got crm command: migrate vm:101 node3
 info     80    node1/crm: migrate service 'vm:101' to node 'node3'
 info     80    node1/crm: service 'vm:101': state changed from 'started' to 'migrate'  (node = node1, target = node3)
@@ -47,7 +47,7 @@ info     81    node1/lrm: service vm:101 - end migrate to node 'node3'
 info    100    node1/crm: service 'vm:101': state changed from 'migrate' to 'started'  (node = node3)
 info    105    node3/lrm: starting service vm:101
 info    105    node3/lrm: service status vm:101 started
-info    160    node1/crm: auto rebalance - migrate vm:104 to node2 (expected change for imbalance from 0.72 to 0.33)
+info    160    node1/crm: auto rebalance - migrate vm:104 to node2 (expected change for imbalance from 0.51 to 0.23)
 info    160    node1/crm: got crm command: migrate vm:104 node2
 info    160    node1/crm: migrate service 'vm:104' to node 'node2'
 info    160    node1/crm: service 'vm:104': state changed from 'started' to 'migrate'  (node = node3, target = node2)
diff --git a/src/test/test-crs-static-auto-rebalance2/log.expect b/src/test/test-crs-static-auto-rebalance2/log.expect
index 6a2ab89..e6d7f7b 100644
--- a/src/test/test-crs-static-auto-rebalance2/log.expect
+++ b/src/test/test-crs-static-auto-rebalance2/log.expect
@@ -34,7 +34,7 @@ info     21    node1/lrm: starting service vm:104
 info     21    node1/lrm: service status vm:104 started
 info     22    node2/crm: status change wait_for_quorum => slave
 info     24    node3/crm: status change wait_for_quorum => slave
-info     80    node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.41 to 0.94)
+info     80    node1/crm: auto rebalance - migrate vm:101 to node2 (expected change for imbalance from 1.00 to 0.66)
 info     80    node1/crm: got crm command: migrate vm:101 node2
 info     80    node1/crm: migrate service 'vm:101' to node 'node2'
 info     80    node1/crm: service 'vm:101': state changed from 'started' to 'migrate'  (node = node1, target = node2)
@@ -45,7 +45,7 @@ info     83    node2/lrm: status change wait_for_agent_lock => active
 info    100    node1/crm: service 'vm:101': state changed from 'migrate' to 'started'  (node = node2)
 info    103    node2/lrm: starting service vm:101
 info    103    node2/lrm: service status vm:101 started
-info    160    node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.94 to 0.35)
+info    160    node1/crm: auto rebalance - migrate vm:102 to node3 (expected change for imbalance from 0.66 to 0.25)
 info    160    node1/crm: got crm command: migrate vm:102 node3
 info    160    node1/crm: migrate service 'vm:102' to node 'node3'
 info    160    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node1, target = node3)
diff --git a/src/test/test-crs-static-auto-rebalance3/log.expect b/src/test/test-crs-static-auto-rebalance3/log.expect
index ecf2d18..d3a8080 100644
--- a/src/test/test-crs-static-auto-rebalance3/log.expect
+++ b/src/test/test-crs-static-auto-rebalance3/log.expect
@@ -53,7 +53,7 @@ info     25    node3/lrm: service status vm:107 started
 info    120      cmdlist: execute service vm:105 set-static-stats maxcpu 8.0 maxmem 8192
 info    120      cmdlist: execute service vm:106 set-static-stats maxcpu 8.0 maxmem 8192
 info    120      cmdlist: execute service vm:107 set-static-stats maxcpu 8.0 maxmem 8192
-info    160    node1/crm: auto rebalance - migrate vm:105 to node1 (expected change for imbalance from 0.88 to 0.47)
+info    160    node1/crm: auto rebalance - migrate vm:105 to node1 (expected change for imbalance from 0.62 to 0.33)
 info    160    node1/crm: got crm command: migrate vm:105 node1
 info    160    node1/crm: migrate service 'vm:105' to node 'node1'
 info    160    node1/crm: service 'vm:105': state changed from 'started' to 'migrate'  (node = node3, target = node1)
@@ -67,7 +67,7 @@ info    220      cmdlist: execute service vm:102 set-static-stats maxcpu 1.0 max
 info    220      cmdlist: execute service vm:103 set-static-stats maxcpu 1.0 maxmem 1024
 info    220      cmdlist: execute service vm:104 set-static-stats maxcpu 1.0 maxmem 1024
 info    220      cmdlist: execute service vm:105 set-static-stats maxcpu 1.0 maxmem 1024
-info    240    node1/crm: auto rebalance - migrate vm:106 to node2 (expected change for imbalance from 0.91 to 0.42)
+info    240    node1/crm: auto rebalance - migrate vm:106 to node2 (expected change for imbalance from 0.64 to 0.30)
 info    240    node1/crm: got crm command: migrate vm:106 node2
 info    240    node1/crm: migrate service 'vm:106' to node 'node2'
 info    240    node1/crm: service 'vm:106': state changed from 'started' to 'migrate'  (node = node3, target = node2)
@@ -76,22 +76,4 @@ info    245    node3/lrm: service vm:106 - end migrate to node 'node2'
 info    260    node1/crm: service 'vm:106': state changed from 'migrate' to 'started'  (node = node2)
 info    263    node2/lrm: starting service vm:106
 info    263    node2/lrm: service status vm:106 started
-info    320    node1/crm: auto rebalance - migrate vm:103 to node1 (expected change for imbalance from 0.42 to 0.31)
-info    320    node1/crm: got crm command: migrate vm:103 node1
-info    320    node1/crm: migrate service 'vm:103' to node 'node1'
-info    320    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node2, target = node1)
-info    323    node2/lrm: service vm:103 - start migrate to node 'node1'
-info    323    node2/lrm: service vm:103 - end migrate to node 'node1'
-info    340    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
-info    341    node1/lrm: starting service vm:103
-info    341    node1/lrm: service status vm:103 started
-info    400    node1/crm: auto rebalance - migrate vm:104 to node1 (expected change for imbalance from 0.31 to 0.20)
-info    400    node1/crm: got crm command: migrate vm:104 node1
-info    400    node1/crm: migrate service 'vm:104' to node 'node1'
-info    400    node1/crm: service 'vm:104': state changed from 'started' to 'migrate'  (node = node2, target = node1)
-info    403    node2/lrm: service vm:104 - start migrate to node 'node1'
-info    403    node2/lrm: service vm:104 - end migrate to node 'node1'
-info    420    node1/crm: service 'vm:104': state changed from 'migrate' to 'started'  (node = node1)
-info    421    node1/lrm: starting service vm:104
-info    421    node1/lrm: service status vm:104 started
 info    820     hardware: exit simulation - done
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH pve-ha-manager 5/7] manager: add load imbalance to status
  2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
                   ` (3 preceding siblings ...)
  2026-04-27 13:20 ` [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
  2026-04-28  9:20   ` Daniel Kral
  2026-04-27 13:20 ` [PATCH pve-ha-manager 6/7] api: status: " Dominik Rusovac
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
 src/PVE/HA/Manager.pm | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index b69a6bb..ba26fbf 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -285,6 +285,7 @@ sub flush_master_status {
     $ms->{node_status} = $ns->{status};
     $ms->{service_status} = $ss;
     $ms->{timestamp} = $haenv->get_time();
+    $ms->{imbalance} = $self->{online_node_usage}->calculate_node_imbalance();
 
     $haenv->write_manager_status($ms);
 }
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH pve-ha-manager 6/7] api: status: add load imbalance to status
  2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
                   ` (4 preceding siblings ...)
  2026-04-27 13:20 ` [PATCH pve-ha-manager 5/7] manager: add load imbalance to status Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
  2026-04-28  9:10   ` Daniel Kral
  2026-04-27 13:20 ` [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options Dominik Rusovac
  2026-04-28  9:21 ` [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Daniel Kral
  7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
  To: pve-devel

This is a very basic measure to enable users to detect the prevailing
load imbalance in the UI, which atm reveals nothing about the latter.

imo, enabling users to track how the load imbalance changed over time
(using RRD graphs, for example) should be considered, in the long run.

Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
 src/PVE/API2/HA/Status.pm | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/PVE/API2/HA/Status.pm b/src/PVE/API2/HA/Status.pm
index 4894f3b..acec78e 100644
--- a/src/PVE/API2/HA/Status.pm
+++ b/src/PVE/API2/HA/Status.pm
@@ -199,7 +199,9 @@ __PACKAGE__->register_method({
             }
             my $datacenter_config = eval { cfs_read_file('datacenter.cfg') } // {};
             if (my $crs = $datacenter_config->{crs}) {
-                $extra_status .= " - $crs->{ha} load CRS"
+                $extra_status .=
+                    " - $crs->{ha} load CRS "
+                    . sprintf("(load imbalance: %.2f", 100 * $status->{imbalance}) . "%)"
                     if $crs->{ha} && $crs->{ha} ne 'basic';
             }
             my $time_str = localtime($status->{timestamp});
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options
  2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
                   ` (5 preceding siblings ...)
  2026-04-27 13:20 ` [PATCH pve-ha-manager 6/7] api: status: " Dominik Rusovac
@ 2026-04-27 13:20 ` Dominik Rusovac
  2026-04-28  8:53   ` Daniel Kral
  2026-04-28  9:21 ` [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Daniel Kral
  7 siblings, 1 reply; 16+ messages in thread
From: Dominik Rusovac @ 2026-04-27 13:20 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
---
 src/PVE/DataCenterConfig.pm | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/PVE/DataCenterConfig.pm b/src/PVE/DataCenterConfig.pm
index 6513594..d120017 100644
--- a/src/PVE/DataCenterConfig.pm
+++ b/src/PVE/DataCenterConfig.pm
@@ -44,6 +44,7 @@ EODESC
         type => 'number',
         optional => 1,
         minimum => 0.0,
+        maximum => 1.0,
         default => 0.3,
         requires => 'ha-auto-rebalance',
         description => "The threshold for the cluster node imbalance, which will"
@@ -72,6 +73,7 @@ EODESC
         type => 'number',
         optional => 1,
         minimum => 0.0,
+        maximum => 1.0,
         default => 0.1,
         requires => 'ha-auto-rebalance',
         description => "The minimum relative improvement in cluster node"
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold
  2026-04-27 13:20 ` [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold Dominik Rusovac
@ 2026-04-28  8:52   ` Daniel Kral
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28  8:52 UTC (permalink / raw)
  To: Dominik Rusovac, pve-devel

On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
>  www/manager6/form/CRSOptions.js | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/www/manager6/form/CRSOptions.js b/www/manager6/form/CRSOptions.js
> index b5476bd5..985eb8cf 100644
> --- a/www/manager6/form/CRSOptions.js
> +++ b/www/manager6/form/CRSOptions.js
> @@ -66,6 +66,7 @@ Ext.define('PVE.form.CRSOptions', {
>                      fieldLabel: gettext('Imbalance Threshold'),
>                      emptyText: '0.3',
>                      minValue: 0.0,
> +                    maxValue: 1.0,
>                      step: 0.01,
>                      bind: {
>                          disabled: '{!enableAutoRebalance.checked}',

Nice!

Could be irritating if users have already set this option to a value to
something greater than 1.0, but as it's a very new feature and an
undocumented setting, this shouldn't be that many users.

Consider this as:

Reviewed-by: Daniel Kral <d.kral@proxmox.com>




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values
  2026-04-27 13:20 ` [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values Dominik Rusovac
@ 2026-04-28  8:52   ` Daniel Kral
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28  8:52 UTC (permalink / raw)
  To: Dominik Rusovac, pve-devel

On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---

This patch should have some commentary in its patch notes why these
values change and why this causes some of the test cases to reduce the
amount of balancing migrations and include the reason in the patch
summary (subject).

AFAICT it's already nice to see here that the selected migrations are
the same, but because of the default imbalance threshold some of the
previously done balancing migrations are cut.

Might be a discussion point to lower the default imbalance threshold
value to roughly the mapped value or if it still is a good default value
for most systems, but that needs more evaluation and is besides this
patch.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options
  2026-04-27 13:20 ` [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options Dominik Rusovac
@ 2026-04-28  8:53   ` Daniel Kral
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28  8:53 UTC (permalink / raw)
  To: Dominik Rusovac, pve-devel

On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
>  src/PVE/DataCenterConfig.pm | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/src/PVE/DataCenterConfig.pm b/src/PVE/DataCenterConfig.pm
> index 6513594..d120017 100644
> --- a/src/PVE/DataCenterConfig.pm
> +++ b/src/PVE/DataCenterConfig.pm
> @@ -44,6 +44,7 @@ EODESC
>          type => 'number',
>          optional => 1,
>          minimum => 0.0,
> +        maximum => 1.0,
>          default => 0.3,
>          requires => 'ha-auto-rebalance',
>          description => "The threshold for the cluster node imbalance, which will"
> @@ -72,6 +73,7 @@ EODESC
>          type => 'number',
>          optional => 1,
>          minimum => 0.0,
> +        maximum => 1.0,

Oh right, this should have already been there before as reducing the
imbalance by more than 100 % makes no sense ;-).

>          default => 0.1,
>          requires => 'ha-auto-rebalance',
>          description => "The minimum relative improvement in cluster node"

nit: would be nice to have some patch message here too why those changes
are fine to make it a little more explicit and the changes to margin
syncs with what is already done in the web interface AFAICT.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values
  2026-04-27 13:20 ` [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values Dominik Rusovac
@ 2026-04-28  8:53   ` Daniel Kral
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28  8:53 UTC (permalink / raw)
  To: Dominik Rusovac, pve-devel

On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
>  proxmox-resource-scheduling/tests/scheduler.rs | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)

This patch should be squashed into the previous one to not break the
build and also make it a little easier to follow why the imbalance
values have changed.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value to unit interval
  2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
@ 2026-04-28  9:05   ` Daniel Kral
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28  9:05 UTC (permalink / raw)
  To: Dominik Rusovac, pve-devel

On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> The currently used load imbalance value is given as the so-called
> coefficient of variation (CV), a value that may exceed 1. As such, the
> CV value alone lacks meaning. A CV value of 0.0 means no imbalance, but
> what does a value of, say, 1.7 mean?
>
> Relative to the number of nodes in a cluster, it is possible to
> determine the upper bound of the CV value [0][1]. By dividing the CV
> value by its upper bound, the load imbalance can be represented as a
> value that varies between 0 and 1. Expressing the CV as a percentage
> makes the concept of load imbalance easier to interpret.

Nice, thanks for the work!

Will test the changes over the week, but just from the better
readability / interpretability of the imbalance value this should make
it more user-friendly overall.

>
> [0] https://repositorio.ipbeja.pt/server/api/core/bitstreams/8ed9a444-dbe0-402f-9d2f-90c5bf6e418c/content
> [1] https://stats.stackexchange.com/questions/18621/maximum-value-of-coefficient-of-variation-for-bounded-data-set

and a good read overall, thanks!

A note above Example 1 and the proposition 13 from the first paper [0]
is interesting here:

    All these properties refer to the case where a single sample is
    considered, however, as noted by [16], Dodd’s corrected coefficient
    of variation, CV corr is not suitable for comparative purpose, as
    can be seen in the next example.

    Example 1. [...]

    [...]

    Proposition 13. Dodd’s corrected coefficient of variation, CV corr,
    is sample-size sensitive.

AFAICT this should not be a problem for us to compare these values to
make decisions since we only compare the values for the same cluster
size configuration and therefore this shouldn't affect us badly.


On another note, it should also be easier for users to set a imbalance
threshold value, which is (at least roughly) invariant to the size of
the cluster. If one or more nodes failed or are in maintenance node,
this dramatically changes the decisions that the load balancer can make
anyway, but I wonder whether how much difference it makes to the
sensitivity to trigger the load balancing system.

>
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
>  proxmox-resource-scheduling/src/scheduler.rs | 33 +++++++++++++-------
>  1 file changed, 21 insertions(+), 12 deletions(-)
>
> diff --git a/proxmox-resource-scheduling/src/scheduler.rs b/proxmox-resource-scheduling/src/scheduler.rs
> index 49d16f9f..4eacbff9 100644
> --- a/proxmox-resource-scheduling/src/scheduler.rs
> +++ b/proxmox-resource-scheduling/src/scheduler.rs
> @@ -17,17 +17,23 @@ pub struct NodeUsage {
>      pub stats: NodeStats,
>  }
>  
> -/// Returns the load imbalance among the nodes.
> +/// Returns the load imbalance among the nodes, which is a value between 0 and 1 that describes the
> +/// statistical dispersion of the individual node loads around the mean node load. The lower the
> +/// value, the better.
>  ///
> -/// The load balance is measured as the statistical dispersion of the individual node loads.
> -///
> -/// The current implementation uses the dimensionless coefficient of variation, which expresses the
> -/// standard deviation in relation to the average mean of the node loads.
> -///
> -/// The coefficient of variation is not robust, which is a desired property here, because outliers
> -/// should be detected as much as possible.
> +/// In more detail, the current implementation computes the so-called coefficient of variation (CV),
> +/// which is the ratio of the standard deviation to the mean of the given node loads. The lower
> +/// bound of the CV is reached if all node loads are equal. The upper bound is reached if all nodes
> +/// except one are idle. To present the CV as a value between 0 and 1, it's being divided by the
> +/// upper bound of the CV for the given number of nodes.
>  fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&NodeUsage) -> f64) -> f64 {
> -    let node_count = nodes.len();
> +    let node_count = nodes.len() as f64;

even though this reduces the amount of 'as f64's below, the node count
is by its nature a positive integer so it should stay that way.

> +
> +    // imbalance is perfect for less than 2 nodes
> +    if node_count < 2.0 {
> +        return 0.0;
> +    }

this could replace the check `load_sum == 0.0` below, which also makes
sure that we don't ever divide by zero (and return a NaN as a result),
and move the assignments for node_loads and load_sum into the else
branch's code path.

A comment could make it more explicit that this avoids dividing by zero.

> +
>      let node_loads = nodes.iter().map(to_load).collect::<Vec<_>>();
>  
>      let load_sum = node_loads.iter().sum::<f64>();
> @@ -36,14 +42,17 @@ fn calculate_node_imbalance(nodes: &[NodeUsage], to_load: impl Fn(&NodeUsage) ->
>      if load_sum == 0.0 {
>          0.0
>      } else {
> -        let load_mean = load_sum / node_count as f64;
> +        let load_mean = load_sum / node_count;
>  
>          let squared_diff_sum = node_loads
>              .iter()
>              .fold(0.0, |sum, node_load| sum + (node_load - load_mean).powi(2));
> -        let load_sd = (squared_diff_sum / node_count as f64).sqrt();
> +        let load_sd = (squared_diff_sum / node_count).sqrt();
> +
> +        let max_cv = (node_count - 1.0).sqrt();
> +        let cv = load_sd / load_mean;

nit: just for aesthetics, could be reordered to cv and then max_cv

also to not loose the reference from the patch message with future
changes, it would be nice to add a comment here why the calculation for
max_cv is correct.

>  
> -        load_sd / load_mean
> +        cv / max_cv
>      }
>  }
>  





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-ha-manager 6/7] api: status: add load imbalance to status
  2026-04-27 13:20 ` [PATCH pve-ha-manager 6/7] api: status: " Dominik Rusovac
@ 2026-04-28  9:10   ` Daniel Kral
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28  9:10 UTC (permalink / raw)
  To: Dominik Rusovac, pve-devel

On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> This is a very basic measure to enable users to detect the prevailing
> load imbalance in the UI, which atm reveals nothing about the latter.
>
> imo, enabling users to track how the load imbalance changed over time
> (using RRD graphs, for example) should be considered, in the long run.
>
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
>  src/PVE/API2/HA/Status.pm | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/src/PVE/API2/HA/Status.pm b/src/PVE/API2/HA/Status.pm
> index 4894f3b..acec78e 100644
> --- a/src/PVE/API2/HA/Status.pm
> +++ b/src/PVE/API2/HA/Status.pm
> @@ -199,7 +199,9 @@ __PACKAGE__->register_method({
>              }
>              my $datacenter_config = eval { cfs_read_file('datacenter.cfg') } // {};
>              if (my $crs = $datacenter_config->{crs}) {
> -                $extra_status .= " - $crs->{ha} load CRS"
> +                $extra_status .=
> +                    " - $crs->{ha} load CRS "
> +                    . sprintf("(load imbalance: %.2f", 100 * $status->{imbalance}) . "%)"
>                      if $crs->{ha} && $crs->{ha} ne 'basic';

I think this should also check whether the load balancing system is
enabled to not clutter the status string if it's unused or no action is
taken if this value is high, but no hard feelings.

>              }
>              my $time_str = localtime($status->{timestamp});





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-ha-manager 5/7] manager: add load imbalance to status
  2026-04-27 13:20 ` [PATCH pve-ha-manager 5/7] manager: add load imbalance to status Dominik Rusovac
@ 2026-04-28  9:20   ` Daniel Kral
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28  9:20 UTC (permalink / raw)
  To: Dominik Rusovac, pve-devel

On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> Signed-off-by: Dominik Rusovac <d.rusovac@proxmox.com>
> ---
>  src/PVE/HA/Manager.pm | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index b69a6bb..ba26fbf 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -285,6 +285,7 @@ sub flush_master_status {
>      $ms->{node_status} = $ns->{status};
>      $ms->{service_status} = $ss;
>      $ms->{timestamp} = $haenv->get_time();
> +    $ms->{imbalance} = $self->{online_node_usage}->calculate_node_imbalance();

Nice! This should allow some better look on the current state of the
load balancer and how well it performs w.r.t. to the imbalance
threshold.

>  
>      $haenv->write_manager_status($ms);
>  }

Consider as:

Reviewed-by: Daniel Kral <d.kral@proxmox.com>




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval
  2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
                   ` (6 preceding siblings ...)
  2026-04-27 13:20 ` [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options Dominik Rusovac
@ 2026-04-28  9:21 ` Daniel Kral
  7 siblings, 0 replies; 16+ messages in thread
From: Daniel Kral @ 2026-04-28  9:21 UTC (permalink / raw)
  To: Dominik Rusovac, pve-devel

On Mon Apr 27, 2026 at 3:20 PM CEST, Dominik Rusovac wrote:
> # TL;DR 
> clamp load imbalance to value between 0 and 1, and display the value as
> percentage in HA Status panel of PVE UI. 
>
> # Details
> The currently used load imbalance value is given as the so-called coefficient of
> variation (CV), a value that may exceed 1. As such, the CV value alone lacks
> meaning. A CV value of 0.0 means no imbalance, but what does a value of, say,
> 1.7 mean?
>
> Relative to the number of nodes in a cluster, it is possible to determine the
> upper bound of the CV value [0][1]. By dividing the CV value by its upper
> bound, the load imbalance can be represented as a value that varies between 0
> and 1. Expressing the CV as a percentage makes the concept of load imbalance
> easier to interpret.
>
> # Summary of Changes
> This series:
> - represents load imbalance as a value between 0 and 1;
> - adds a maximum value of 1.0 for load scheduler options; and
> - integrates the load imbalance value within the HA status endpoint; 
>   this is to provide feedback on the prevailing load imbalance in the PVE UI.

As discussed off-list, it would be interesting to also keep a history of
the imbalance value for the cluster.

In that discussion we also wondered whether we could derive that history
without changing the rrdcached schema at all by fetching the
average/maximum values for the already pre-defined time frames (each
minute, each hour, etc.) and use the same calculate_node_imbalance(),
but just for the raw values.

Haven't checked how much error this does introduce since the rrdcached
values are different from the sampled values fetched from the rrddump in
the HA Manager simply because these are averaged out, but it would be
interesting if the introduced error is negible enough.




^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-28  9:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 13:20 [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Dominik Rusovac
2026-04-27 13:20 ` [PATCH proxmox 1/7] resource-scheduling: clamp imbalance value " Dominik Rusovac
2026-04-28  9:05   ` Daniel Kral
2026-04-27 13:20 ` [PATCH proxmox 2/7] resource-scheduling: re-adjust hardcoded imbalance values Dominik Rusovac
2026-04-28  8:53   ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-manager 3/7] ui: from/CRSOptions: add maximum for threshold Dominik Rusovac
2026-04-28  8:52   ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 4/7] test: re-adjust logged imbalance values Dominik Rusovac
2026-04-28  8:52   ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 5/7] manager: add load imbalance to status Dominik Rusovac
2026-04-28  9:20   ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-ha-manager 6/7] api: status: " Dominik Rusovac
2026-04-28  9:10   ` Daniel Kral
2026-04-27 13:20 ` [PATCH pve-cluster 7/7] datacenter config: add maxima for load scheduler options Dominik Rusovac
2026-04-28  8:53   ` Daniel Kral
2026-04-28  9:21 ` [RFC PATCH-SERIES cluster/ha-manager/manager/proxmox 0/7] clamp load imbalance to unit interval Daniel Kral

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal