[pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting

From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting
Date: Tue, 30 Sep 2025 16:19:07 +0200	[thread overview]
Message-ID: <20250930142021.366529-1-d.kral@proxmox.com> (raw)

An initial RFC for making the changes to the HA $online_node_usage
object more granular.

The current implementation rebuilds the $online_node_usage object quite
often. The most significant event causing a rebuild is changing a HA
resource's state.

One worst-case scenario I tested was a 3-node cluster using the static
load scheduler with rebalance_on_request_start set with a varying amount
of homogenous HA resources equally distributed on the nodes at startup.
When these are all configured to start and rebalance at the same time,
2/3 of them will move to different nodes in a round-robin fashion: Each
HA resource is then added in the next manage(...) call to the Manager
status' services in state 'request_start' and 1/3 will change to state
'started' while 2/3 will change to state 'request_start_balance' (all
calling select_service_node(...) at least once to make that decision).

The biggest bottleneck here is how many guest configurations need to be
read and parsed to retrieve the necessary static information on each
rebuilt $online_node_usage, which increases in each HA resource being
handled, which is addressed with the following patches.

= Patches =

qemu-server patch #1   fetches default values only when needed
proxmox patch #1       necessary for pve-rs patch
pve-rs patch #1        allow removing service usage from nodes
ha-manager patch #1    implement static cache and use
                       PVE::Cluster::get_guest_config_properties(...)
ha-manager patch #2-#4 remove redundant $online_node_usage updates
ha-manager patch #5-#6 some decoupling and refactoring
      --- next patches need proxmox #1 and proxmox-perl-rs #1 ---
ha-manager patch #7-#8 setup $online_node_usage only once per round and
                       make changes granular inbetween
ha-manager patch #9    setup $online_node_usage only when the scheduler
                       mode has been changed
                       (will not acknowledge changes to static stats for
                       any HA resource which is already running and
                       hasn't changed its state yet, see TODO for more)

= Benchmarks =

Here are some rudimentary benchmarks with the above setup (3 nodes
cluster, static load scheduler, rebalance_on_request_start set) in a
virtualized environment. The columns are for HA resource count and the
rows are for different patches applied (qm #1 = qemu-server patch #1).

Run-times for the first manage(...) call to rebalance HA resources:

                300             3,000           9,000

master          27.2 s          -               -
#1              33.0 s          -               -
#8              605 ms          5.45 s          11.7 s
#8 + qm #1      248 ms          2.62 s          4.07 s

Average and total run-times for
PVE::HA::Usage::Static::score_nodes_to_start_service(...):

                300             3,000           9,000

master          2.49 ms (0.7 s) -               -
#1              454 µs (136 ms) -               -
#8              303 µs (91 ms)  367 µs (1.1 s)  429 µs (3.87 s)
#8 + qm #1      102 µs (30 ms)  102 µs (305 ms) 110 µs (991 ms)

I haven't included #9 here, because it doesn't acknowledge hotplug guest
changes and improves the amortized runtime, not the single run benchmark
from above, where #9 needs to do roughly the same work as #8.

I ran `sync && echo 3 > /proc/sys/vm/drop_caches` before each of the
following runs to write dirty pages and clear the slab + page cache, but
I haven't looked into memdb / fuse too closely if these take any effect
here. Either way, the benchmarks aren't that isolated anyway here and
should still be taken with a big pile of salt.

= TODO =

- patch #9 doesn't acknowledge any hotplug changes to cores / cpulimit /
  memory yet; the ha-manager would need some external notification for
  that (inotify or a 'invalidate_cache' / 'service $sid update' crm
  command?), but maybe that's too much and wouldn't bring any significant
  improvements.. Especially when thinking about an upcoming 'dynamic'
  mode, these stats updates would become way more common than with static
  resource changes, so #8 + qm #1 might be enough for now..

- some more performance improvements

  there is still some room and potential to improve the performance here
  and there, e.g. buffering syslog writes, caching the parsed + default
  values for the guests, make {parse,get_current}_memory faster, but I
  wanted to get some reviews in first; also these benchmarks are on
  master without the HA rules performance improvements from another
  series

- a more automated benchmark infrastructure

  I have added some (simulated) benchmark scripts in my local tree, which
  I want to add to a future revision when these are more polished, so we
  can have a better watch on the runtime when changes are introduced; In
  the future it would also be great to have some automated benchmarks /
  test cases that can be run on actual clusters without the hassle to set
  them up by hand

= Future ideas =

- Acknowledge node CPU + memory hotplug?

- When rebalancing HA resources on startup, it might be better to make
  the decision for multiple HA resources at the same time, so that for
  homogenous nodes and services aren't rebalanced in a round-robin
  fashion as above, even though that example might be exaggerated

qemu-server:

Daniel Kral (1):
  config: only fetch necessary default values in get_derived_property
    helper

 src/PVE/QemuConfig.pm | 8 +++-----
 src/PVE/QemuServer.pm | 6 ++++++
 2 files changed, 9 insertions(+), 5 deletions(-)

proxmox:

Daniel Kral (1):
  resource-scheduling: change score_nodes_to_start_service signature

 proxmox-resource-scheduling/src/pve_static.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

perl-rs:

Daniel Kral (1):
  pve-rs: resource_scheduling: allow granular usage changes

 .../bindings/resource_scheduling_static.rs    | 98 ++++++++++++++++---
 pve-rs/test/resource_scheduling.pl            | 84 ++++++++++++++--
 2 files changed, 160 insertions(+), 22 deletions(-)

ha-manager:

Daniel Kral (9):
  implement static service stats cache
  manager: remove redundant recompute_online_node_usage from
    next_state_recovery
  manager: remove redundant add_service_usage_to_node from
    next_state_recovery
  manager: remove redundant add_service_usage_to_node from
    next_state_started
  rules: resource affinity: decouple get_resource_affinity helper from
    Usage class
  manager: make recompute_online_node_usage use get_service_nodes helper
  usage: allow granular changes to Usage implementations
  manager: make online node usage computation granular
  manager: make service node usage computation more granular

 src/PVE/HA/Env.pm                    |  12 ++
 src/PVE/HA/Env/PVE2.pm               |  21 ++++
 src/PVE/HA/Manager.pm                | 167 ++++++++++++---------------
 src/PVE/HA/Resources/PVECT.pm        |   3 +-
 src/PVE/HA/Resources/PVEVM.pm        |   3 +-
 src/PVE/HA/Rules/ResourceAffinity.pm |  15 ++-
 src/PVE/HA/Sim/Env.pm                |  12 ++
 src/PVE/HA/Sim/Hardware.pm           |  31 +++--
 src/PVE/HA/Sim/Resources.pm          |   4 +-
 src/PVE/HA/Tools.pm                  |  29 +++++
 src/PVE/HA/Usage.pm                  |  24 +---
 src/PVE/HA/Usage/Basic.pm            |  33 ++----
 src/PVE/HA/Usage/Static.pm           |  45 ++++----
 src/test/test_failover1.pl           |  17 +--
 14 files changed, 233 insertions(+), 183 deletions(-)

Summary over all repositories:
  19 files changed, 403 insertions(+), 211 deletions(-)

-- 
Generated by git-murpp 0.8.0

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel