From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 2EF501FF137
	for <inbox@lore.proxmox.com>; Tue, 17 Feb 2026 15:20:02 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 85424545C;
	Tue, 17 Feb 2026 15:20:51 +0100 (CET)
From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [RFC PATCH-SERIES many 00/36] dynamic scheduler + load rebalancer
Date: Tue, 17 Feb 2026 15:13:54 +0100
Message-ID: <20260217141437.584852-1-d.kral@proxmox.com>
X-Mailer: git-send-email 2.47.3
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1771337675870
X-SPAM-LEVEL: Spam detection results:  0
	AWL                    -0.107 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	KAM_LOTSOFHASH           0.25 Emails with lots of hash-like gibberish
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: FMCHVMW6UWGZO6CB3XQ5UEOWKEI5JSXY
X-Message-ID-Hash: FMCHVMW6UWGZO6CB3XQ5UEOWKEI5JSXY
X-MailFrom: d.kral@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

This RFC series proposes an implementation for a dynamic scheduler and
manual/automatic static/dynamic load rebalancer by implementing the
following:

- make the basic and static scheduler acknowledge the active usage of
  running, non-HA resources when making decisions,

- gather dynamic node and service usage information and use it in the
  dynamic scheduler, and

- implement a load rebalancer, which actively moves HA resources to
  other nodes, to lower the overall cluster node imbalance, while
  adhering to the HA rules.


== Model ==

The automatic load rebalancing system checks whether the cluster node
imbalance exceeds some user-defined threshold for some HA Manager rounds
("hold duration"). If it does exceed on consecutive HA Manager rounds,
it will choose the best service migration/relocation to improve the
cluster node imbalance and queue it if it significantly improves it by
some user-defined improvement ("margin").

The best service motion can be selected by either bruteforce or TOPSIS.
This selection method and some other parameters from above can be
tweaked at runtime in this RFC revision, but will likely be reduced to a
minimum in a final revision to allow further improvements in the future
without pinning us to a specific model in the background.


== Tests ==

I've added some rudimentary test cases to ensure more basic decisions
are documented. Otherwise, I've done tests in homogeneous and
heterogeneous virtualized clusters with adding load dynamically to
guests with stress-ng and plan to rely more on real-world load
simulators for the next batch of tests.

The repositories were all tested individually with their derivations of
`git rebase master --exec 'make clean && make deb` and the old pve-rs
shared library can be built with the new proxmox-resource-scheduling.
Additionally, the old pve-rs can still be built with the new
proxmox-resource-scheduling package.

Otherwise, I'd be very welcome from feedback from a different variety of
use cases. In this RFC version, the pve-ha-crm service reports the
node imbalance every ~10 seconds to the syslog to keep an eye on that.


== Benchmarks ==

I've also done some theoretical benchmarks with the target of being able
to handle a 48 nodes cluster and 9.999 HA resources / guests and a
worst-case scenario of each HA resource being part of 3 HA rules
(pairwise positive and negative resource affinity rules, where each
positive resource affinity pair has a common node affinity rule).

Generating the migration candidates for the huge cluster with the
worst-case HA ruleset takes 243 +- 9 ms.

Generating the migration candidates for the huge cluster without the
worst-case HA ruleset (to gain the most amount of 459954 migration
candidates) takes 356 +- 6 ms. This is expected, because we need to
evaluate more HA resources' rules as there are no HA resource bundles.

Excluding the generation, the brute force and TOPSIS method for
select_best_balancing_migration() were roughly similar both being in the
range 350 +- 50 ms for the huge cluster without any HA rules (for the
maximum amount of migration candidates) including the serialization
between Perl and Rust.


== TODO ==

- as always, more test cases (high priority)

- fix rebalancing on service start for dynamic mode with
  yet-to-be-started HA resources, see known issues (high priority)

- decide on whether to use bruteforce or TOPSIS to choose a service
  migration when rebalancing. Currently, I included both methods in the
  implementation, but more practical tests will show which is ideal.
  (high priority)

- assess whether an additional individual node load threshold is needed
  to trigger the load balancing when extreme outlier cases are not
  detected (e.g. 4 nodes 90% load, 1 node 0% load - would need an
  imbalance threshold of 0.5 to detect) (high priority)

- allow individual HA resources to be actively excluded from the
  automatic rebalancing, e.g., because containers cannot be live
  migrated (medium priority)

- move the migration candidate generation to the rust-side; the
  generation on the perl-side was chosen first to reduce code
  duplication, but it doesn't seem future proof and right to copy state
  to the online_node_usage object twice (medium priority)

- user/admin documentation (medium priority)

- factor out the common hashmap build-ups from the static and dynamic
  load scheduler (low priority)

- probably move these to the proxmox-resource-scheduling crate to make
  the perlmod bindings wrapper thinner (low priority)


== Future ideas ==

- include the migration costs in score_best_balancing_migrations(),
  e.g., so that VMs with lots of memory are less likely to be migrated
  if the link between the nodes is slow, but that would need measuring
  and storing the migration network link speeds as a mesh

- apply some filter like moving average window or exponential smoothing
  on the usage time series to dampen spikes; triple exponential
  smoothing (Holts-Winters) is also already implemented in rrdcached and
  allows for exponential smoothing with better time series analysis but
  would require changing the rrdcached data structure once more

- score_best_balancing_migrations(...) can already provide a
  size-limited list of the best migrations, which could be exposed to
  users to allow manual load balancing actions, e.g., from the web
  interface, to get some insight in the system

- The current scheduler can only solve bin covering, but it would be
  interesting to also allow bin packing if certain criteria are met,
  e.g., for energy preservation while the overall cluster load is low


== Known issues / Discussion ==

(1) score_nodes_to_start_service() in Dynamic Mode

Since we derive the node's usage from the rrddump exclusively for the
dynamic mode, if a HA resource is scheduled to be started, it will not
add any usage to their assigned node(s). This ensures that we always
work with the actual runtime values to not skew any decisions for the
hypothetical resource commitment values from the HA resource guest's
config.

This has the side-effect that score_nodes_to_start_service(...) won't
accommodate for other already scheduled HA resources. This differs from
what the basic and static load scheduler do, let me illustrate by an
example:

...

a) Basic / Static Mode

A 3 node cluster with no HA resources running yet

1. vm:100 is on node1 and in state 'request_start'

2. 'ha-rebalance-on-start' is set, so check for the best starting node
   with select_service_node(..., 'best-score'); If the cluster is
   homogeneous, score_nodes_to_start_service(...) will score all nodes
   equally, therefore vm:100 stays on node1

3. vm:100 is set to the state 'started'; this will immediately add the
   vm:100's maxcpu and maxmem to node1's usage, even though vm:100 is
   not started yet by node1's LRM

4. Next, vm:101 is also on node1 and in state 'request_start'

5. Same procedure, but now score_nodes_to_start_service(...) will score
   node1 worse than node2 and node3, because node1 already has vm:100's
   usage added, and will choose node2

b) Dynamic Mode

Same setup and same actions, with these steps differing:

3. vm:100 is set to the state 'started'; but as vm:100 is not actually
   started yet, it is not accounted in node1's usage yet

5. score_nodes_to_start_service(...) will also score all nodes equally,
   because the node usages haven't actually changed; it will also choose
   node1 as its starting node

...

There are multiple solutions to this:

1. This is fine, the actual usage is only known while running and the
   automatic rebalancer will do its job at the cost of migrating the
   guests live (worse)

2. Make the HA resources, which are scheduled to start, but not started
   yet special, and add them to a hypothetical node usage, which is only
   used for score_nodes_to_start_service(...)

3. Recording a characteristic load that is known from previous runs of
   that HA resource and use that as a weight when scoring the starting
   node.

N. ...

Solution 2 seems like the most reasonable to implement at the moment,
even though it might not be perfect.


== Diffstat ==


proxmox:

Daniel Kral (5):
  resource-scheduling: move score_nodes_to_start_service to scheduler
    crate
  resource-scheduling: introduce generic cluster usage implementation
  resource-scheduling: add dynamic node and service stats
  resource-scheduling: implement rebalancing migration selection
  resource-scheduling: implement Add and Default for
    {Dynamic,Static}ServiceStats

 proxmox-resource-scheduling/src/lib.rs        |   3 +
 .../src/pve_dynamic.rs                        |  68 +++
 proxmox-resource-scheduling/src/pve_static.rs | 110 ++---
 proxmox-resource-scheduling/src/scheduler.rs  | 436 ++++++++++++++++++
 4 files changed, 552 insertions(+), 65 deletions(-)
 create mode 100644 proxmox-resource-scheduling/src/pve_dynamic.rs
 create mode 100644 proxmox-resource-scheduling/src/scheduler.rs


base-commit: 984affa2c9149710d4f832c7522ddd3eb8802000
prerequisite-patch-id: 6b73d4dd683cbad857fbd527e1f2d53a709f2eef
prerequisite-patch-id: df7b2e2d7d42a588b7ecc18f475967ecdd0f25ef
prerequisite-patch-id: cffcc847e63bae6ce133507882e010da234df429
prerequisite-patch-id: a4699e1ba53fe9003b9b70f11eae50e924516fa1
prerequisite-patch-id: cf1be5cf0be66fc1388be3a269be5b10290d643d
prerequisite-patch-id: 3802a06b0b641438d949480479f83b8bd5000a0f
prerequisite-patch-id: 2890c6a14768e3fdfd3f35daccbe1953b0da199c
prerequisite-patch-id: 54c4904ef279bc1e64fe4904f862f690beb2d221
prerequisite-patch-id: a863374a7d7b814e74b1cacef3d1f7ed7428e45c
prerequisite-patch-id: 9a9452055bc5b1e3cfabadbb72567afe721b5856
prerequisite-patch-id: a4437f7be281e8efbf9628272ad41f9de2586994
prerequisite-patch-id: 14555958f0c6c1f3b6fae29f998139c900a97162

perl-rs:

Daniel Kral (6):
  pve-rs: resource scheduling: use generic cluster usage implementation
  pve-rs: resource scheduling: create service_nodes hashset from array
  pve-rs: resource scheduling: store service stats independently of node
  pve-rs: resource scheduling: expose auto rebalancing methods
  pve-rs: resource scheduling: move pve_static into resource_scheduling
    module
  pve-rs: resource scheduling: implement pve_dynamic bindings

 pve-rs/Makefile                               |   1 +
 pve-rs/src/bindings/mod.rs                    |   3 +-
 .../src/bindings/resource_scheduling/mod.rs   |  20 +
 .../resource_scheduling/pve_dynamic.rs        | 349 +++++++++++++++++
 .../resource_scheduling/pve_static.rs         | 365 ++++++++++++++++++
 .../bindings/resource_scheduling_static.rs    | 215 -----------
 pve-rs/test/resource_scheduling.pl            |   1 +
 7 files changed, 737 insertions(+), 217 deletions(-)
 create mode 100644 pve-rs/src/bindings/resource_scheduling/mod.rs
 create mode 100644 pve-rs/src/bindings/resource_scheduling/pve_dynamic.rs
 create mode 100644 pve-rs/src/bindings/resource_scheduling/pve_static.rs
 delete mode 100644 pve-rs/src/bindings/resource_scheduling_static.rs


cluster:

Daniel Kral (2):
  datacenter config: add dynamic load scheduler option
  datacenter config: add auto rebalancing options

 src/PVE/DataCenterConfig.pm | 43 +++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)


base-commit: 300978f19f91e3e226a80bb69ba6a21ec279e869

ha-manager:

Daniel Kral (17):
  rename static node stats to be consistent with similar interfaces
  resources: remove redundant load_config fallback for static config
  remove redundant service_node and migration_target parameter
  factor out common pve to ha resource type mapping
  derive static service stats while filling the service stats repository
  test: make static service usage explicit for all resources
  make static service stats indexable by sid
  move static service stats repository to PVE::HA::Usage::Static
  usage: augment service stats with node and state information
  include running non-HA resources in the scheduler's accounting
  env, resources: add dynamic node and service stats abstraction
  env: pve2: implement dynamic node and service stats
  usage: add dynamic usage scheduler
  manager: rename execute_migration to queue_resource_motion
  manager: update_crs_scheduler_mode: factor out crs config
  implement automatic rebalancing
  test: add basic automatic rebalancing system test cases

Dominik Rusovac (4):
  sim: hardware: pass correct types for static stats
  sim: hardware: factor out static stats' default values
  sim: hardware: rewrite set-static-stats
  sim: hardware: add set-dynamic-stats for services

 debian/pve-ha-manager.install                 |   1 +
 src/PVE/HA/Config.pm                          |   8 +-
 src/PVE/HA/Env.pm                             |  28 ++-
 src/PVE/HA/Env/PVE2.pm                        | 126 ++++++++--
 src/PVE/HA/Manager.pm                         | 208 ++++++++++++++-
 src/PVE/HA/Resources.pm                       |   4 +-
 src/PVE/HA/Resources/PVECT.pm                 |   7 +-
 src/PVE/HA/Resources/PVEVM.pm                 |   7 +-
 src/PVE/HA/Sim/Env.pm                         |  28 ++-
 src/PVE/HA/Sim/Hardware.pm                    | 236 ++++++++++++++++--
 src/PVE/HA/Sim/RTHardware.pm                  |   3 +-
 src/PVE/HA/Sim/Resources.pm                   |  17 --
 src/PVE/HA/Tools.pm                           |  23 +-
 src/PVE/HA/Usage.pm                           |  48 +++-
 src/PVE/HA/Usage/Basic.pm                     |   6 +-
 src/PVE/HA/Usage/Dynamic.pm                   | 160 ++++++++++++
 src/PVE/HA/Usage/Makefile                     |   2 +-
 src/PVE/HA/Usage/Static.pm                    | 101 +++++---
 .../test-crs-dynamic-auto-rebalance0/README   |   2 +
 .../test-crs-dynamic-auto-rebalance0/cmdlist  |   3 +
 .../datacenter.cfg                            |   8 +
 .../dynamic_service_stats                     |   1 +
 .../hardware_status                           |   5 +
 .../log.expect                                |  11 +
 .../manager_status                            |   1 +
 .../service_config                            |   1 +
 .../static_service_stats                      |   1 +
 .../test-crs-dynamic-auto-rebalance1/README   |   6 +
 .../test-crs-dynamic-auto-rebalance1/cmdlist  |   3 +
 .../datacenter.cfg                            |   8 +
 .../dynamic_service_stats                     |   3 +
 .../hardware_status                           |   5 +
 .../log.expect                                |  25 ++
 .../manager_status                            |   1 +
 .../service_config                            |   3 +
 .../static_service_stats                      |   3 +
 .../test-crs-dynamic-auto-rebalance2/README   |   3 +
 .../test-crs-dynamic-auto-rebalance2/cmdlist  |   3 +
 .../datacenter.cfg                            |   8 +
 .../dynamic_service_stats                     |   6 +
 .../hardware_status                           |   5 +
 .../log.expect                                |  59 +++++
 .../manager_status                            |   1 +
 .../service_config                            |   6 +
 .../static_service_stats                      |   6 +
 .../test-crs-dynamic-auto-rebalance3/README   |   3 +
 .../test-crs-dynamic-auto-rebalance3/cmdlist  |  24 ++
 .../datacenter.cfg                            |   8 +
 .../dynamic_service_stats                     |   9 +
 .../hardware_status                           |   5 +
 .../log.expect                                |  88 +++++++
 .../manager_status                            |   1 +
 .../service_config                            |   9 +
 .../static_service_stats                      |   9 +
 .../hardware_status                           |   6 +-
 .../hardware_status                           |   6 +-
 .../hardware_status                           |  10 +-
 .../hardware_status                           |   6 +-
 .../static_service_stats                      |  52 +++-
 .../hardware_status                           |   6 +-
 .../static_service_stats                      |   9 +-
 src/test/test-crs-static1/hardware_status     |   6 +-
 src/test/test-crs-static2/hardware_status     |  10 +-
 src/test/test-crs-static3/hardware_status     |   6 +-
 src/test/test-crs-static4/hardware_status     |   6 +-
 src/test/test-crs-static5/hardware_status     |   6 +-
 66 files changed, 1290 insertions(+), 195 deletions(-)
 create mode 100644 src/PVE/HA/Usage/Dynamic.pm
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/README
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/cmdlist
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/datacenter.cfg
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/dynamic_service_stats
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/hardware_status
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/log.expect
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/manager_status
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/service_config
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/static_service_stats
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/README
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/cmdlist
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/datacenter.cfg
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/dynamic_service_stats
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/hardware_status
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/log.expect
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/manager_status
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/service_config
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/static_service_stats
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/README
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/cmdlist
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/datacenter.cfg
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/dynamic_service_stats
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/hardware_status
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/log.expect
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/manager_status
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/service_config
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/static_service_stats
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/README
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/cmdlist
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/datacenter.cfg
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/dynamic_service_stats
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/hardware_status
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/log.expect
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/manager_status
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/service_config
 create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/static_service_stats


manager:

Daniel Kral (2):
  ui: dc/options: add dynamic load scheduler option
  ui: dc/options: add auto rebalancing options

 www/manager6/dc/OptionView.js | 46 +++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)


base-commit: 71482d1833ded40a25a78b67f09cc1975acf92c9
prerequisite-patch-id: 25cc6a017d5278d73a77510dfa90379cef4d66b1

Summary over all repositories:
  79 files changed, 2666 insertions(+), 479 deletions(-)

-- 
Generated by murpp 0.9.0