From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 2EF501FF137 for ; Tue, 17 Feb 2026 15:20:02 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 85424545C; Tue, 17 Feb 2026 15:20:51 +0100 (CET) From: Daniel Kral To: pve-devel@lists.proxmox.com Subject: [RFC PATCH-SERIES many 00/36] dynamic scheduler + load rebalancer Date: Tue, 17 Feb 2026 15:13:54 +0100 Message-ID: <20260217141437.584852-1-d.kral@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1771337675870 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.107 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_LOTSOFHASH 0.25 Emails with lots of hash-like gibberish RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: FMCHVMW6UWGZO6CB3XQ5UEOWKEI5JSXY X-Message-ID-Hash: FMCHVMW6UWGZO6CB3XQ5UEOWKEI5JSXY X-MailFrom: d.kral@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: This RFC series proposes an implementation for a dynamic scheduler and manual/automatic static/dynamic load rebalancer by implementing the following: - make the basic and static scheduler acknowledge the active usage of running, non-HA resources when making decisions, - gather dynamic node and service usage information and use it in the dynamic scheduler, and - implement a load rebalancer, which actively moves HA resources to other nodes, to lower the overall cluster node imbalance, while adhering to the HA rules. == Model == The automatic load rebalancing system checks whether the cluster node imbalance exceeds some user-defined threshold for some HA Manager rounds ("hold duration"). If it does exceed on consecutive HA Manager rounds, it will choose the best service migration/relocation to improve the cluster node imbalance and queue it if it significantly improves it by some user-defined improvement ("margin"). The best service motion can be selected by either bruteforce or TOPSIS. This selection method and some other parameters from above can be tweaked at runtime in this RFC revision, but will likely be reduced to a minimum in a final revision to allow further improvements in the future without pinning us to a specific model in the background. == Tests == I've added some rudimentary test cases to ensure more basic decisions are documented. Otherwise, I've done tests in homogeneous and heterogeneous virtualized clusters with adding load dynamically to guests with stress-ng and plan to rely more on real-world load simulators for the next batch of tests. The repositories were all tested individually with their derivations of `git rebase master --exec 'make clean && make deb` and the old pve-rs shared library can be built with the new proxmox-resource-scheduling. Additionally, the old pve-rs can still be built with the new proxmox-resource-scheduling package. Otherwise, I'd be very welcome from feedback from a different variety of use cases. In this RFC version, the pve-ha-crm service reports the node imbalance every ~10 seconds to the syslog to keep an eye on that. == Benchmarks == I've also done some theoretical benchmarks with the target of being able to handle a 48 nodes cluster and 9.999 HA resources / guests and a worst-case scenario of each HA resource being part of 3 HA rules (pairwise positive and negative resource affinity rules, where each positive resource affinity pair has a common node affinity rule). Generating the migration candidates for the huge cluster with the worst-case HA ruleset takes 243 +- 9 ms. Generating the migration candidates for the huge cluster without the worst-case HA ruleset (to gain the most amount of 459954 migration candidates) takes 356 +- 6 ms. This is expected, because we need to evaluate more HA resources' rules as there are no HA resource bundles. Excluding the generation, the brute force and TOPSIS method for select_best_balancing_migration() were roughly similar both being in the range 350 +- 50 ms for the huge cluster without any HA rules (for the maximum amount of migration candidates) including the serialization between Perl and Rust. == TODO == - as always, more test cases (high priority) - fix rebalancing on service start for dynamic mode with yet-to-be-started HA resources, see known issues (high priority) - decide on whether to use bruteforce or TOPSIS to choose a service migration when rebalancing. Currently, I included both methods in the implementation, but more practical tests will show which is ideal. (high priority) - assess whether an additional individual node load threshold is needed to trigger the load balancing when extreme outlier cases are not detected (e.g. 4 nodes 90% load, 1 node 0% load - would need an imbalance threshold of 0.5 to detect) (high priority) - allow individual HA resources to be actively excluded from the automatic rebalancing, e.g., because containers cannot be live migrated (medium priority) - move the migration candidate generation to the rust-side; the generation on the perl-side was chosen first to reduce code duplication, but it doesn't seem future proof and right to copy state to the online_node_usage object twice (medium priority) - user/admin documentation (medium priority) - factor out the common hashmap build-ups from the static and dynamic load scheduler (low priority) - probably move these to the proxmox-resource-scheduling crate to make the perlmod bindings wrapper thinner (low priority) == Future ideas == - include the migration costs in score_best_balancing_migrations(), e.g., so that VMs with lots of memory are less likely to be migrated if the link between the nodes is slow, but that would need measuring and storing the migration network link speeds as a mesh - apply some filter like moving average window or exponential smoothing on the usage time series to dampen spikes; triple exponential smoothing (Holts-Winters) is also already implemented in rrdcached and allows for exponential smoothing with better time series analysis but would require changing the rrdcached data structure once more - score_best_balancing_migrations(...) can already provide a size-limited list of the best migrations, which could be exposed to users to allow manual load balancing actions, e.g., from the web interface, to get some insight in the system - The current scheduler can only solve bin covering, but it would be interesting to also allow bin packing if certain criteria are met, e.g., for energy preservation while the overall cluster load is low == Known issues / Discussion == (1) score_nodes_to_start_service() in Dynamic Mode Since we derive the node's usage from the rrddump exclusively for the dynamic mode, if a HA resource is scheduled to be started, it will not add any usage to their assigned node(s). This ensures that we always work with the actual runtime values to not skew any decisions for the hypothetical resource commitment values from the HA resource guest's config. This has the side-effect that score_nodes_to_start_service(...) won't accommodate for other already scheduled HA resources. This differs from what the basic and static load scheduler do, let me illustrate by an example: ... a) Basic / Static Mode A 3 node cluster with no HA resources running yet 1. vm:100 is on node1 and in state 'request_start' 2. 'ha-rebalance-on-start' is set, so check for the best starting node with select_service_node(..., 'best-score'); If the cluster is homogeneous, score_nodes_to_start_service(...) will score all nodes equally, therefore vm:100 stays on node1 3. vm:100 is set to the state 'started'; this will immediately add the vm:100's maxcpu and maxmem to node1's usage, even though vm:100 is not started yet by node1's LRM 4. Next, vm:101 is also on node1 and in state 'request_start' 5. Same procedure, but now score_nodes_to_start_service(...) will score node1 worse than node2 and node3, because node1 already has vm:100's usage added, and will choose node2 b) Dynamic Mode Same setup and same actions, with these steps differing: 3. vm:100 is set to the state 'started'; but as vm:100 is not actually started yet, it is not accounted in node1's usage yet 5. score_nodes_to_start_service(...) will also score all nodes equally, because the node usages haven't actually changed; it will also choose node1 as its starting node ... There are multiple solutions to this: 1. This is fine, the actual usage is only known while running and the automatic rebalancer will do its job at the cost of migrating the guests live (worse) 2. Make the HA resources, which are scheduled to start, but not started yet special, and add them to a hypothetical node usage, which is only used for score_nodes_to_start_service(...) 3. Recording a characteristic load that is known from previous runs of that HA resource and use that as a weight when scoring the starting node. N. ... Solution 2 seems like the most reasonable to implement at the moment, even though it might not be perfect. == Diffstat == proxmox: Daniel Kral (5): resource-scheduling: move score_nodes_to_start_service to scheduler crate resource-scheduling: introduce generic cluster usage implementation resource-scheduling: add dynamic node and service stats resource-scheduling: implement rebalancing migration selection resource-scheduling: implement Add and Default for {Dynamic,Static}ServiceStats proxmox-resource-scheduling/src/lib.rs | 3 + .../src/pve_dynamic.rs | 68 +++ proxmox-resource-scheduling/src/pve_static.rs | 110 ++--- proxmox-resource-scheduling/src/scheduler.rs | 436 ++++++++++++++++++ 4 files changed, 552 insertions(+), 65 deletions(-) create mode 100644 proxmox-resource-scheduling/src/pve_dynamic.rs create mode 100644 proxmox-resource-scheduling/src/scheduler.rs base-commit: 984affa2c9149710d4f832c7522ddd3eb8802000 prerequisite-patch-id: 6b73d4dd683cbad857fbd527e1f2d53a709f2eef prerequisite-patch-id: df7b2e2d7d42a588b7ecc18f475967ecdd0f25ef prerequisite-patch-id: cffcc847e63bae6ce133507882e010da234df429 prerequisite-patch-id: a4699e1ba53fe9003b9b70f11eae50e924516fa1 prerequisite-patch-id: cf1be5cf0be66fc1388be3a269be5b10290d643d prerequisite-patch-id: 3802a06b0b641438d949480479f83b8bd5000a0f prerequisite-patch-id: 2890c6a14768e3fdfd3f35daccbe1953b0da199c prerequisite-patch-id: 54c4904ef279bc1e64fe4904f862f690beb2d221 prerequisite-patch-id: a863374a7d7b814e74b1cacef3d1f7ed7428e45c prerequisite-patch-id: 9a9452055bc5b1e3cfabadbb72567afe721b5856 prerequisite-patch-id: a4437f7be281e8efbf9628272ad41f9de2586994 prerequisite-patch-id: 14555958f0c6c1f3b6fae29f998139c900a97162 perl-rs: Daniel Kral (6): pve-rs: resource scheduling: use generic cluster usage implementation pve-rs: resource scheduling: create service_nodes hashset from array pve-rs: resource scheduling: store service stats independently of node pve-rs: resource scheduling: expose auto rebalancing methods pve-rs: resource scheduling: move pve_static into resource_scheduling module pve-rs: resource scheduling: implement pve_dynamic bindings pve-rs/Makefile | 1 + pve-rs/src/bindings/mod.rs | 3 +- .../src/bindings/resource_scheduling/mod.rs | 20 + .../resource_scheduling/pve_dynamic.rs | 349 +++++++++++++++++ .../resource_scheduling/pve_static.rs | 365 ++++++++++++++++++ .../bindings/resource_scheduling_static.rs | 215 ----------- pve-rs/test/resource_scheduling.pl | 1 + 7 files changed, 737 insertions(+), 217 deletions(-) create mode 100644 pve-rs/src/bindings/resource_scheduling/mod.rs create mode 100644 pve-rs/src/bindings/resource_scheduling/pve_dynamic.rs create mode 100644 pve-rs/src/bindings/resource_scheduling/pve_static.rs delete mode 100644 pve-rs/src/bindings/resource_scheduling_static.rs cluster: Daniel Kral (2): datacenter config: add dynamic load scheduler option datacenter config: add auto rebalancing options src/PVE/DataCenterConfig.pm | 43 +++++++++++++++++++++++++++++++++++-- 1 file changed, 41 insertions(+), 2 deletions(-) base-commit: 300978f19f91e3e226a80bb69ba6a21ec279e869 ha-manager: Daniel Kral (17): rename static node stats to be consistent with similar interfaces resources: remove redundant load_config fallback for static config remove redundant service_node and migration_target parameter factor out common pve to ha resource type mapping derive static service stats while filling the service stats repository test: make static service usage explicit for all resources make static service stats indexable by sid move static service stats repository to PVE::HA::Usage::Static usage: augment service stats with node and state information include running non-HA resources in the scheduler's accounting env, resources: add dynamic node and service stats abstraction env: pve2: implement dynamic node and service stats usage: add dynamic usage scheduler manager: rename execute_migration to queue_resource_motion manager: update_crs_scheduler_mode: factor out crs config implement automatic rebalancing test: add basic automatic rebalancing system test cases Dominik Rusovac (4): sim: hardware: pass correct types for static stats sim: hardware: factor out static stats' default values sim: hardware: rewrite set-static-stats sim: hardware: add set-dynamic-stats for services debian/pve-ha-manager.install | 1 + src/PVE/HA/Config.pm | 8 +- src/PVE/HA/Env.pm | 28 ++- src/PVE/HA/Env/PVE2.pm | 126 ++++++++-- src/PVE/HA/Manager.pm | 208 ++++++++++++++- src/PVE/HA/Resources.pm | 4 +- src/PVE/HA/Resources/PVECT.pm | 7 +- src/PVE/HA/Resources/PVEVM.pm | 7 +- src/PVE/HA/Sim/Env.pm | 28 ++- src/PVE/HA/Sim/Hardware.pm | 236 ++++++++++++++++-- src/PVE/HA/Sim/RTHardware.pm | 3 +- src/PVE/HA/Sim/Resources.pm | 17 -- src/PVE/HA/Tools.pm | 23 +- src/PVE/HA/Usage.pm | 48 +++- src/PVE/HA/Usage/Basic.pm | 6 +- src/PVE/HA/Usage/Dynamic.pm | 160 ++++++++++++ src/PVE/HA/Usage/Makefile | 2 +- src/PVE/HA/Usage/Static.pm | 101 +++++--- .../test-crs-dynamic-auto-rebalance0/README | 2 + .../test-crs-dynamic-auto-rebalance0/cmdlist | 3 + .../datacenter.cfg | 8 + .../dynamic_service_stats | 1 + .../hardware_status | 5 + .../log.expect | 11 + .../manager_status | 1 + .../service_config | 1 + .../static_service_stats | 1 + .../test-crs-dynamic-auto-rebalance1/README | 6 + .../test-crs-dynamic-auto-rebalance1/cmdlist | 3 + .../datacenter.cfg | 8 + .../dynamic_service_stats | 3 + .../hardware_status | 5 + .../log.expect | 25 ++ .../manager_status | 1 + .../service_config | 3 + .../static_service_stats | 3 + .../test-crs-dynamic-auto-rebalance2/README | 3 + .../test-crs-dynamic-auto-rebalance2/cmdlist | 3 + .../datacenter.cfg | 8 + .../dynamic_service_stats | 6 + .../hardware_status | 5 + .../log.expect | 59 +++++ .../manager_status | 1 + .../service_config | 6 + .../static_service_stats | 6 + .../test-crs-dynamic-auto-rebalance3/README | 3 + .../test-crs-dynamic-auto-rebalance3/cmdlist | 24 ++ .../datacenter.cfg | 8 + .../dynamic_service_stats | 9 + .../hardware_status | 5 + .../log.expect | 88 +++++++ .../manager_status | 1 + .../service_config | 9 + .../static_service_stats | 9 + .../hardware_status | 6 +- .../hardware_status | 6 +- .../hardware_status | 10 +- .../hardware_status | 6 +- .../static_service_stats | 52 +++- .../hardware_status | 6 +- .../static_service_stats | 9 +- src/test/test-crs-static1/hardware_status | 6 +- src/test/test-crs-static2/hardware_status | 10 +- src/test/test-crs-static3/hardware_status | 6 +- src/test/test-crs-static4/hardware_status | 6 +- src/test/test-crs-static5/hardware_status | 6 +- 66 files changed, 1290 insertions(+), 195 deletions(-) create mode 100644 src/PVE/HA/Usage/Dynamic.pm create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/README create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/cmdlist create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/datacenter.cfg create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/dynamic_service_stats create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/hardware_status create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/log.expect create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/manager_status create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/service_config create mode 100644 src/test/test-crs-dynamic-auto-rebalance0/static_service_stats create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/README create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/cmdlist create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/datacenter.cfg create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/dynamic_service_stats create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/hardware_status create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/log.expect create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/manager_status create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/service_config create mode 100644 src/test/test-crs-dynamic-auto-rebalance1/static_service_stats create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/README create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/cmdlist create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/datacenter.cfg create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/dynamic_service_stats create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/hardware_status create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/log.expect create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/manager_status create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/service_config create mode 100644 src/test/test-crs-dynamic-auto-rebalance2/static_service_stats create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/README create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/cmdlist create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/datacenter.cfg create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/dynamic_service_stats create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/hardware_status create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/log.expect create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/manager_status create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/service_config create mode 100644 src/test/test-crs-dynamic-auto-rebalance3/static_service_stats manager: Daniel Kral (2): ui: dc/options: add dynamic load scheduler option ui: dc/options: add auto rebalancing options www/manager6/dc/OptionView.js | 46 +++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) base-commit: 71482d1833ded40a25a78b67f09cc1975acf92c9 prerequisite-patch-id: 25cc6a017d5278d73a77510dfa90379cef4d66b1 Summary over all repositories: 79 files changed, 2666 insertions(+), 479 deletions(-) -- Generated by murpp 0.9.0