* [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting
@ 2025-09-30 14:19 Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
An initial RFC for making the changes to the HA $online_node_usage
object more granular.
The current implementation rebuilds the $online_node_usage object quite
often. The most significant event causing a rebuild is changing a HA
resource's state.
One worst-case scenario I tested was a 3-node cluster using the static
load scheduler with rebalance_on_request_start set with a varying amount
of homogenous HA resources equally distributed on the nodes at startup.
When these are all configured to start and rebalance at the same time,
2/3 of them will move to different nodes in a round-robin fashion: Each
HA resource is then added in the next manage(...) call to the Manager
status' services in state 'request_start' and 1/3 will change to state
'started' while 2/3 will change to state 'request_start_balance' (all
calling select_service_node(...) at least once to make that decision).
The biggest bottleneck here is how many guest configurations need to be
read and parsed to retrieve the necessary static information on each
rebuilt $online_node_usage, which increases in each HA resource being
handled, which is addressed with the following patches.
= Patches =
qemu-server patch #1 fetches default values only when needed
proxmox patch #1 necessary for pve-rs patch
pve-rs patch #1 allow removing service usage from nodes
ha-manager patch #1 implement static cache and use
PVE::Cluster::get_guest_config_properties(...)
ha-manager patch #2-#4 remove redundant $online_node_usage updates
ha-manager patch #5-#6 some decoupling and refactoring
--- next patches need proxmox #1 and proxmox-perl-rs #1 ---
ha-manager patch #7-#8 setup $online_node_usage only once per round and
make changes granular inbetween
ha-manager patch #9 setup $online_node_usage only when the scheduler
mode has been changed
(will not acknowledge changes to static stats for
any HA resource which is already running and
hasn't changed its state yet, see TODO for more)
= Benchmarks =
Here are some rudimentary benchmarks with the above setup (3 nodes
cluster, static load scheduler, rebalance_on_request_start set) in a
virtualized environment. The columns are for HA resource count and the
rows are for different patches applied (qm #1 = qemu-server patch #1).
Run-times for the first manage(...) call to rebalance HA resources:
300 3,000 9,000
master 27.2 s - -
#1 33.0 s - -
#8 605 ms 5.45 s 11.7 s
#8 + qm #1 248 ms 2.62 s 4.07 s
Average and total run-times for
PVE::HA::Usage::Static::score_nodes_to_start_service(...):
300 3,000 9,000
master 2.49 ms (0.7 s) - -
#1 454 µs (136 ms) - -
#8 303 µs (91 ms) 367 µs (1.1 s) 429 µs (3.87 s)
#8 + qm #1 102 µs (30 ms) 102 µs (305 ms) 110 µs (991 ms)
I haven't included #9 here, because it doesn't acknowledge hotplug guest
changes and improves the amortized runtime, not the single run benchmark
from above, where #9 needs to do roughly the same work as #8.
I ran `sync && echo 3 > /proc/sys/vm/drop_caches` before each of the
following runs to write dirty pages and clear the slab + page cache, but
I haven't looked into memdb / fuse too closely if these take any effect
here. Either way, the benchmarks aren't that isolated anyway here and
should still be taken with a big pile of salt.
= TODO =
- patch #9 doesn't acknowledge any hotplug changes to cores / cpulimit /
memory yet; the ha-manager would need some external notification for
that (inotify or a 'invalidate_cache' / 'service $sid update' crm
command?), but maybe that's too much and wouldn't bring any significant
improvements.. Especially when thinking about an upcoming 'dynamic'
mode, these stats updates would become way more common than with static
resource changes, so #8 + qm #1 might be enough for now..
- some more performance improvements
there is still some room and potential to improve the performance here
and there, e.g. buffering syslog writes, caching the parsed + default
values for the guests, make {parse,get_current}_memory faster, but I
wanted to get some reviews in first; also these benchmarks are on
master without the HA rules performance improvements from another
series
- a more automated benchmark infrastructure
I have added some (simulated) benchmark scripts in my local tree, which
I want to add to a future revision when these are more polished, so we
can have a better watch on the runtime when changes are introduced; In
the future it would also be great to have some automated benchmarks /
test cases that can be run on actual clusters without the hassle to set
them up by hand
= Future ideas =
- Acknowledge node CPU + memory hotplug?
- When rebalancing HA resources on startup, it might be better to make
the decision for multiple HA resources at the same time, so that for
homogenous nodes and services aren't rebalanced in a round-robin
fashion as above, even though that example might be exaggerated
qemu-server:
Daniel Kral (1):
config: only fetch necessary default values in get_derived_property
helper
src/PVE/QemuConfig.pm | 8 +++-----
src/PVE/QemuServer.pm | 6 ++++++
2 files changed, 9 insertions(+), 5 deletions(-)
proxmox:
Daniel Kral (1):
resource-scheduling: change score_nodes_to_start_service signature
proxmox-resource-scheduling/src/pve_static.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
perl-rs:
Daniel Kral (1):
pve-rs: resource_scheduling: allow granular usage changes
.../bindings/resource_scheduling_static.rs | 98 ++++++++++++++++---
pve-rs/test/resource_scheduling.pl | 84 ++++++++++++++--
2 files changed, 160 insertions(+), 22 deletions(-)
ha-manager:
Daniel Kral (9):
implement static service stats cache
manager: remove redundant recompute_online_node_usage from
next_state_recovery
manager: remove redundant add_service_usage_to_node from
next_state_recovery
manager: remove redundant add_service_usage_to_node from
next_state_started
rules: resource affinity: decouple get_resource_affinity helper from
Usage class
manager: make recompute_online_node_usage use get_service_nodes helper
usage: allow granular changes to Usage implementations
manager: make online node usage computation granular
manager: make service node usage computation more granular
src/PVE/HA/Env.pm | 12 ++
src/PVE/HA/Env/PVE2.pm | 21 ++++
src/PVE/HA/Manager.pm | 167 ++++++++++++---------------
src/PVE/HA/Resources/PVECT.pm | 3 +-
src/PVE/HA/Resources/PVEVM.pm | 3 +-
src/PVE/HA/Rules/ResourceAffinity.pm | 15 ++-
src/PVE/HA/Sim/Env.pm | 12 ++
src/PVE/HA/Sim/Hardware.pm | 31 +++--
src/PVE/HA/Sim/Resources.pm | 4 +-
src/PVE/HA/Tools.pm | 29 +++++
src/PVE/HA/Usage.pm | 24 +---
src/PVE/HA/Usage/Basic.pm | 33 ++----
src/PVE/HA/Usage/Static.pm | 45 ++++----
src/test/test_failover1.pl | 17 +--
14 files changed, 233 insertions(+), 183 deletions(-)
Summary over all repositories:
19 files changed, 403 insertions(+), 211 deletions(-)
--
Generated by git-murpp 0.8.0
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature Daniel Kral
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
get_derived_property(...) is called in the semi-hot path of the HA
Manager's static load scheduler to retrieve the static stats of each VM.
As the defaults are only needed in certain cases and for a very small
subset of properties in the VM config, get those separately when needed.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
get_current_memory(...) is still quite costly here, because it calls
parse_memory(...), which calls
PVE::JSONSchema::parse_property_string(...), which adds up for many
guest configurations parsed in every manage(...) call, but this already
helps quite a lot.
src/PVE/QemuConfig.pm | 8 +++-----
src/PVE/QemuServer.pm | 6 ++++++
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/PVE/QemuConfig.pm b/src/PVE/QemuConfig.pm
index d0844c4c..078c87e0 100644
--- a/src/PVE/QemuConfig.pm
+++ b/src/PVE/QemuConfig.pm
@@ -582,12 +582,10 @@ sub load_current_config {
sub get_derived_property {
my ($class, $conf, $name) = @_;
- my $defaults = PVE::QemuServer::load_defaults();
-
if ($name eq 'max-cpu') {
- my $cpus =
- ($conf->{sockets} || $defaults->{sockets}) * ($conf->{cores} || $defaults->{cores});
- return $conf->{vcpus} || $cpus;
+ my $sockets = $conf->{sockets} || PVE::QemuServer::get_default_property_value('sockets');
+ my $cores = $conf->{cores} || PVE::QemuServer::get_default_property_value('cores');
+ return $conf->{vcpus} || ($sockets * $cores);
} elsif ($name eq 'max-memory') { # current usage maximum, not maximum hotpluggable
return get_current_memory($conf->{memory}) * 1024 * 1024;
} else {
diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index 7d5ab718..459ba8f0 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -2314,6 +2314,12 @@ sub write_vm_config {
return $raw;
}
+sub get_default_property_value {
+ my ($name) = @_;
+
+ return $confdesc->{$name}->{default};
+}
+
sub load_defaults {
my $res = {};
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes Daniel Kral
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
This is needed as StaticNodeUsage is created in each invocation of
PVE::RS::ResourceScheduling::Static::score_nodes_to_start_service now.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
Since it has only one callee, I haven't bothered to make this take
either a borrow or a reference..
proxmox-resource-scheduling/src/pve_static.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/proxmox-resource-scheduling/src/pve_static.rs b/proxmox-resource-scheduling/src/pve_static.rs
index d39614cd..fc40cb5c 100644
--- a/proxmox-resource-scheduling/src/pve_static.rs
+++ b/proxmox-resource-scheduling/src/pve_static.rs
@@ -70,7 +70,7 @@ criteria_struct! {
/// Returns a vector of (nodename, score) pairs. Scores are between 0.0 and 1.0 and a higher score
/// is better.
pub fn score_nodes_to_start_service(
- nodes: &[&StaticNodeUsage],
+ nodes: &[StaticNodeUsage],
service: &StaticServiceUsage,
) -> Result<Vec<(String, f64)>, Error> {
let len = nodes.len();
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache Daniel Kral
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
Implements a simple bidirectional map to track which service usages have
been added to nodes, so that these can be removed later individually.
The StaticServiceUsage is added to the HashMap<> in StaticNodeInfo to
reduce unnecessary indirection when summing these values in
score_nodes_to_start_service(...).
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
I started out adding and removing the service_usage in StaticNodeUsage
on each add_service_usage_to_node(...) and remove_service_usage(...)
call individually, but I think summing those up every time is better for
numerical stability..
.../bindings/resource_scheduling_static.rs | 98 ++++++++++++++++---
pve-rs/test/resource_scheduling.pl | 84 ++++++++++++++--
2 files changed, 160 insertions(+), 22 deletions(-)
diff --git a/pve-rs/src/bindings/resource_scheduling_static.rs b/pve-rs/src/bindings/resource_scheduling_static.rs
index 7cf2f35..535dfe3 100644
--- a/pve-rs/src/bindings/resource_scheduling_static.rs
+++ b/pve-rs/src/bindings/resource_scheduling_static.rs
@@ -16,8 +16,16 @@ pub mod pve_rs_resource_scheduling_static {
perlmod::declare_magic!(Box<Scheduler> : &Scheduler as "PVE::RS::ResourceScheduling::Static");
+ struct StaticNodeInfo {
+ name: String,
+ maxcpu: usize,
+ maxmem: usize,
+ services: HashMap<String, StaticServiceUsage>,
+ }
+
struct Usage {
- nodes: HashMap<String, StaticNodeUsage>,
+ nodes: HashMap<String, StaticNodeInfo>,
+ service_nodes: HashMap<String, Vec<String>>,
}
/// A scheduler instance contains the resource usage by node.
@@ -30,6 +38,7 @@ pub mod pve_rs_resource_scheduling_static {
pub fn new(#[raw] class: Value) -> Result<Value, Error> {
let inner = Usage {
nodes: HashMap::new(),
+ service_nodes: HashMap::new(),
};
Ok(perlmod::instantiate_magic!(
@@ -39,7 +48,7 @@ pub mod pve_rs_resource_scheduling_static {
/// Method: Add a node with its basic CPU and memory info.
///
- /// This inserts a [`StaticNodeUsage`] entry for the node into the scheduler instance.
+ /// This inserts a [`StaticNodeInfo`] entry for the node into the scheduler instance.
#[export]
pub fn add_node(
#[try_from_ref] this: &Scheduler,
@@ -53,12 +62,11 @@ pub mod pve_rs_resource_scheduling_static {
bail!("node {} already added", nodename);
}
- let node = StaticNodeUsage {
+ let node = StaticNodeInfo {
name: nodename.clone(),
- cpu: 0.0,
maxcpu,
- mem: 0,
maxmem,
+ services: HashMap::new(),
};
usage.nodes.insert(nodename, node);
@@ -67,10 +75,25 @@ pub mod pve_rs_resource_scheduling_static {
/// Method: Remove a node from the scheduler.
#[export]
- pub fn remove_node(#[try_from_ref] this: &Scheduler, nodename: &str) {
+ pub fn remove_node(#[try_from_ref] this: &Scheduler, nodename: &str) -> Result<(), Error> {
let mut usage = this.inner.lock().unwrap();
- usage.nodes.remove(nodename);
+ if let Some(node) = usage.nodes.remove(nodename) {
+ for (sid, _) in node.services.iter() {
+ match usage.service_nodes.get_mut(sid) {
+ Some(nodes) => {
+ nodes.retain(|node| !node.eq(nodename));
+ }
+ None => bail!(
+ "service '{}' not present while removing node '{}'",
+ sid,
+ nodename
+ ),
+ }
+ }
+ }
+
+ Ok(())
}
/// Method: Get a list of all the nodes in the scheduler.
@@ -93,22 +116,55 @@ pub mod pve_rs_resource_scheduling_static {
usage.nodes.contains_key(nodename)
}
- /// Method: Add usage of `service` to the node's usage.
+ /// Method: Add service `sid` and its `service_usage` to the node.
#[export]
pub fn add_service_usage_to_node(
#[try_from_ref] this: &Scheduler,
nodename: &str,
- service: StaticServiceUsage,
+ sid: &str,
+ service_usage: StaticServiceUsage,
) -> Result<(), Error> {
let mut usage = this.inner.lock().unwrap();
match usage.nodes.get_mut(nodename) {
Some(node) => {
- node.add_service_usage(&service);
- Ok(())
+ if node.services.contains_key(sid) {
+ bail!("service '{}' already added to node '{}'", sid, nodename);
+ }
+
+ node.services.insert(sid.to_string(), service_usage);
}
None => bail!("node '{}' not present in usage hashmap", nodename),
}
+
+ if let Some(nodes) = usage.service_nodes.get_mut(sid) {
+ nodes.push(nodename.to_string());
+ } else {
+ usage
+ .service_nodes
+ .insert(sid.to_string(), vec![nodename.to_string()]);
+ }
+
+ Ok(())
+ }
+
+ /// Method: Remove service `sid` and its usage from all assigned nodes.
+ #[export]
+ fn remove_service_usage(#[try_from_ref] this: &Scheduler, sid: &str) -> Result<(), Error> {
+ let mut usage = this.inner.lock().unwrap();
+
+ if let Some(nodes) = usage.service_nodes.remove(sid) {
+ for nodename in &nodes {
+ match usage.nodes.get_mut(nodename) {
+ Some(node) => {
+ node.services.remove(sid);
+ }
+ None => bail!("service '{}' not present on node '{}'", sid, nodename),
+ }
+ }
+ }
+
+ Ok(())
}
/// Scores all previously added nodes for starting a `service` on.
@@ -126,7 +182,25 @@ pub mod pve_rs_resource_scheduling_static {
service: StaticServiceUsage,
) -> Result<Vec<(String, f64)>, Error> {
let usage = this.inner.lock().unwrap();
- let nodes = usage.nodes.values().collect::<Vec<&StaticNodeUsage>>();
+ let nodes = usage
+ .nodes
+ .values()
+ .map(|node| {
+ let mut node_usage = StaticNodeUsage {
+ name: node.name.to_string(),
+ cpu: 0.0,
+ maxcpu: node.maxcpu,
+ mem: 0,
+ maxmem: node.maxmem,
+ };
+
+ for service in node.services.values() {
+ node_usage.add_service_usage(service);
+ }
+
+ node_usage
+ })
+ .collect::<Vec<StaticNodeUsage>>();
proxmox_resource_scheduling::pve_static::score_nodes_to_start_service(&nodes, &service)
}
diff --git a/pve-rs/test/resource_scheduling.pl b/pve-rs/test/resource_scheduling.pl
index e3b7d2e..318238e 100755
--- a/pve-rs/test/resource_scheduling.pl
+++ b/pve-rs/test/resource_scheduling.pl
@@ -7,6 +7,20 @@ use Test::More;
use PVE::RS::ResourceScheduling::Static;
+my sub score_nodes {
+ my ($static, $service) = @_;
+
+ my $score_list = $static->score_nodes_to_start_service($service);
+
+ # imitate HA manager
+ my $scores = { map { $_->[0] => -$_->[1] } $score_list->@* };
+ my @nodes = sort {
+ $scores->{$a} <=> $scores->{$b} || $a cmp $b
+ } keys $scores->%*;
+
+ return @nodes;
+}
+
sub test_basic {
my $static = PVE::RS::ResourceScheduling::Static->new();
is(scalar($static->list_nodes()->@*), 0, 'node list empty');
@@ -33,6 +47,7 @@ sub test_balance {
maxmem => 20_000_000_000,
};
+ my @sid_names = ("a" .. "z");
for (my $i = 0; $i < 15; $i++) {
my $score_list = $static->score_nodes_to_start_service($service);
@@ -50,7 +65,54 @@ sub test_balance {
is($nodes[1], "A", 'second should be A');
}
- $static->add_service_usage_to_node($nodes[0], $service);
+ $static->add_service_usage_to_node($nodes[0], $sid_names[$i], $service);
+ }
+}
+
+sub test_balance_removal {
+ my $static = PVE::RS::ResourceScheduling::Static->new();
+ $static->add_node("A", 10, 100_000_000_000);
+ $static->add_node("B", 20, 200_000_000_000);
+ $static->add_node("C", 30, 300_000_000_000);
+
+ my $service = {
+ maxcpu => 4,
+ maxmem => 20_000_000_000,
+ };
+
+ $static->add_service_usage_to_node("A", "a", $service);
+ $static->add_service_usage_to_node("A", "b", $service);
+ $static->add_service_usage_to_node("B", "c", $service);
+ $static->add_service_usage_to_node("B", "d", $service);
+ $static->add_service_usage_to_node("C", "c", $service);
+
+ {
+ my @nodes = score_nodes($static, $service);
+
+ is($nodes[0], "C");
+ is($nodes[1], "B");
+ is($nodes[2], "A");
+ }
+
+ $static->remove_service_usage("d");
+ $static->remove_service_usage("c");
+ $static->add_service_usage_to_node("C", "c", $service);
+
+ {
+ my @nodes = score_nodes($static, $service);
+
+ is($nodes[0], "B");
+ is($nodes[1], "C");
+ is($nodes[2], "A");
+ }
+
+ $static->remove_node("B");
+
+ {
+ my @nodes = score_nodes($static, $service);
+
+ is($nodes[0], "C");
+ is($nodes[1], "A");
}
}
@@ -66,11 +128,11 @@ sub test_overcommitted {
maxmem => 536_870_912,
};
- $static->add_service_usage_to_node("A", $service);
- $static->add_service_usage_to_node("A", $service);
- $static->add_service_usage_to_node("A", $service);
- $static->add_service_usage_to_node("B", $service);
- $static->add_service_usage_to_node("A", $service);
+ $static->add_service_usage_to_node("A", "a", $service);
+ $static->add_service_usage_to_node("A", "b", $service);
+ $static->add_service_usage_to_node("A", "c", $service);
+ $static->add_service_usage_to_node("B", "d", $service);
+ $static->add_service_usage_to_node("A", "e", $service);
my $score_list = $static->score_nodes_to_start_service($service);
@@ -96,9 +158,9 @@ sub test_balance_small_memory_difference {
$static->add_node("C", 4, 8_000_000_000);
if ($with_start_load) {
- $static->add_service_usage_to_node("A", { maxcpu => 4, maxmem => 1_000_000_000 });
- $static->add_service_usage_to_node("B", { maxcpu => 2, maxmem => 1_000_000_000 });
- $static->add_service_usage_to_node("C", { maxcpu => 2, maxmem => 1_000_000_000 });
+ $static->add_service_usage_to_node("A", "a", { maxcpu => 4, maxmem => 1_000_000_000 });
+ $static->add_service_usage_to_node("B", "b", { maxcpu => 2, maxmem => 1_000_000_000 });
+ $static->add_service_usage_to_node("C", "c", { maxcpu => 2, maxmem => 1_000_000_000 });
}
my $service = {
@@ -106,6 +168,7 @@ sub test_balance_small_memory_difference {
maxmem => 16_000_000,
};
+ my @sid_names = ("d" .. "z");
for (my $i = 0; $i < 20; $i++) {
my $score_list = $static->score_nodes_to_start_service($service);
@@ -131,12 +194,13 @@ sub test_balance_small_memory_difference {
die "internal error, got $i % 4 == " . ($i % 4) . "\n";
}
- $static->add_service_usage_to_node($nodes[0], $service);
+ $static->add_service_usage_to_node($nodes[0], $sid_names[$i], $service);
}
}
test_basic();
test_balance();
+test_balance_removal();
test_overcommitted();
test_balance_small_memory_difference(1);
test_balance_small_memory_difference(0);
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (2 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery Daniel Kral
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
As the HA Manager builds the static load scheduler, it queries the
services' static usage by reading and parsing the static guest configs
individually, which can take significant time with respect to how many
times recompute_online_node_usage(...) is called.
PVE::Cluster exposes an efficient interface to gather a set of
properties from one or all guest configs [0]. This is used here to build
a rather short-lived cache on every (re)initialization of the static
load scheduler.
The downside to this approach is if there are way more non-HA managed
guests in the cluster than HA-managed guests, which causes this to load
much more information than necessary. It also doesn't cache the default
values for the environment-specific static service usage, which causes
quite a bottleneck as well.
[0] pve-cluster cf1b19d (add get_guest_config_property IPCC method)
Suggested-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
If the above mentioned downside is too large for some setups, we could
also just add a switch to enable/disable the static stats cache or
automatically figure out if it brings any benefits.. But I think this
will be good enough, especially with the latter patches making way fewer
calls to get_service_usage(...).
src/PVE/HA/Env.pm | 12 ++++++++++++
src/PVE/HA/Env/PVE2.pm | 21 +++++++++++++++++++++
src/PVE/HA/Manager.pm | 1 +
src/PVE/HA/Resources/PVECT.pm | 3 ++-
src/PVE/HA/Resources/PVEVM.pm | 3 ++-
src/PVE/HA/Sim/Env.pm | 12 ++++++++++++
src/PVE/HA/Sim/Hardware.pm | 31 +++++++++++++++++++++----------
src/PVE/HA/Sim/Resources.pm | 4 ++--
8 files changed, 73 insertions(+), 14 deletions(-)
diff --git a/src/PVE/HA/Env.pm b/src/PVE/HA/Env.pm
index e00272a0..4282d33f 100644
--- a/src/PVE/HA/Env.pm
+++ b/src/PVE/HA/Env.pm
@@ -300,6 +300,18 @@ sub get_datacenter_settings {
return $self->{plug}->get_datacenter_settings();
}
+sub get_static_service_stats {
+ my ($self, $id) = @_;
+
+ return $self->{plug}->get_static_service_stats($id);
+}
+
+sub update_static_service_stats {
+ my ($self) = @_;
+
+ return $self->{plug}->update_static_service_stats();
+}
+
sub get_static_node_stats {
my ($self) = @_;
diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm
index e76e86b8..e1d752f7 100644
--- a/src/PVE/HA/Env/PVE2.pm
+++ b/src/PVE/HA/Env/PVE2.pm
@@ -49,6 +49,8 @@ sub new {
$self->{nodename} = $nodename;
+ $self->{static_service_stats} = undef;
+
return $self;
}
@@ -497,6 +499,25 @@ sub get_datacenter_settings {
};
}
+sub get_static_service_stats {
+ my ($self, $id) = @_;
+
+ # undef if update_static_service_stats(...) failed before
+ return undef if !defined($self->{static_service_stats});
+
+ return $self->{static_service_stats}->{$id} // {};
+}
+
+sub update_static_service_stats {
+ my ($self) = @_;
+
+ my $properties = ['cores', 'cpulimit', 'memory', 'sockets', 'vcpus'];
+ my $stats = eval { PVE::Cluster::get_guest_config_properties($properties) };
+ $self->log('warning', "unable to update static service stats cache - $@") if $@;
+
+ $self->{static_service_stats} = $stats;
+}
+
sub get_static_node_stats {
my ($self) = @_;
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index ba59f642..3f81f233 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -247,6 +247,7 @@ sub recompute_online_node_usage {
$online_node_usage = eval {
my $scheduler = PVE::HA::Usage::Static->new($haenv);
$scheduler->add_node($_) for $online_nodes->@*;
+ $haenv->update_static_service_stats();
return $scheduler;
};
$haenv->log(
diff --git a/src/PVE/HA/Resources/PVECT.pm b/src/PVE/HA/Resources/PVECT.pm
index 44644d92..091249d7 100644
--- a/src/PVE/HA/Resources/PVECT.pm
+++ b/src/PVE/HA/Resources/PVECT.pm
@@ -156,7 +156,8 @@ sub remove_locks {
sub get_static_stats {
my ($class, $haenv, $id, $service_node) = @_;
- my $conf = PVE::LXC::Config->load_config($id, $service_node);
+ my $conf = $haenv->get_static_service_stats($id);
+ $conf = PVE::LXC::Config->load_config($id, $service_node) if !defined($conf);
return {
maxcpu => PVE::LXC::Config->get_derived_property($conf, 'max-cpu'),
diff --git a/src/PVE/HA/Resources/PVEVM.pm b/src/PVE/HA/Resources/PVEVM.pm
index e634fe3c..d1bc3329 100644
--- a/src/PVE/HA/Resources/PVEVM.pm
+++ b/src/PVE/HA/Resources/PVEVM.pm
@@ -177,7 +177,8 @@ sub remove_locks {
sub get_static_stats {
my ($class, $haenv, $id, $service_node) = @_;
- my $conf = PVE::QemuConfig->load_config($id, $service_node);
+ my $conf = $haenv->get_static_service_stats($id);
+ $conf = PVE::QemuConfig->load_config($id, $service_node) if !defined($conf);
return {
maxcpu => PVE::QemuConfig->get_derived_property($conf, 'max-cpu'),
diff --git a/src/PVE/HA/Sim/Env.pm b/src/PVE/HA/Sim/Env.pm
index 684e92f8..1d70026e 100644
--- a/src/PVE/HA/Sim/Env.pm
+++ b/src/PVE/HA/Sim/Env.pm
@@ -488,6 +488,18 @@ sub get_datacenter_settings {
};
}
+sub get_static_service_stats {
+ my ($self, $id) = @_;
+
+ return $self->{hardware}->get_static_service_stats($id);
+}
+
+sub update_static_service_stats {
+ my ($self) = @_;
+
+ return $self->{hardware}->update_static_service_stats();
+}
+
sub get_static_node_stats {
my ($self) = @_;
diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
index 9e8c7995..fae27b2a 100644
--- a/src/PVE/HA/Sim/Hardware.pm
+++ b/src/PVE/HA/Sim/Hardware.pm
@@ -387,16 +387,6 @@ sub write_service_status {
return $res;
}
-sub read_static_service_stats {
- my ($self) = @_;
-
- my $filename = "$self->{statusdir}/static_service_stats";
- my $stats = eval { PVE::HA::Tools::read_json_from_file($filename) };
- $self->log('error', "loading static service stats failed - $@") if $@;
-
- return $stats;
-}
-
sub new {
my ($this, $testdir) = @_;
@@ -477,6 +467,8 @@ sub new {
$self->{service_config} = $self->read_service_config();
+ $self->{static_service_stats} = undef;
+
return $self;
}
@@ -943,6 +935,25 @@ sub watchdog_update {
return &$modify_watchog($self, $code);
}
+sub get_static_service_stats {
+ my ($self, $id) = @_;
+
+ # undef if update_static_service_stats(...) failed before
+ return undef if !defined($self->{static_service_stats});
+
+ return $self->{static_service_stats}->{$id} // {};
+}
+
+sub update_static_service_stats {
+ my ($self) = @_;
+
+ my $filename = "$self->{statusdir}/static_service_stats";
+ my $stats = eval { PVE::HA::Tools::read_json_from_file($filename) };
+ $self->log('warning', "unable to update static service stats cache - $@") if $@;
+
+ $self->{static_service_stats} = $stats;
+}
+
sub get_static_node_stats {
my ($self) = @_;
diff --git a/src/PVE/HA/Sim/Resources.pm b/src/PVE/HA/Sim/Resources.pm
index 72623ee1..7641b1a9 100644
--- a/src/PVE/HA/Sim/Resources.pm
+++ b/src/PVE/HA/Sim/Resources.pm
@@ -143,8 +143,8 @@ sub get_static_stats {
my $sid = $class->type() . ":$id";
my $hardware = $haenv->hardware();
- my $stats = $hardware->read_static_service_stats();
- if (my $service_stats = $stats->{$sid}) {
+ my $service_stats = $hardware->get_static_service_stats($sid);
+ if (%$service_stats) {
return $service_stats;
} elsif ($id =~ /^(\d)(\d\d)/) {
# auto assign usage calculated from ID for convenience
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (3 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node " Daniel Kral
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
recompute_online_node_usage(...) is currently only dependent on the
configured CRS scheduler mode, the list of online nodes and the HA
resource's state, current node and optional migration target.
The proper recovery of HA resources was introduced in 9da84a0d ("fix
'change_service_location' misuse and recovery from fencing") as a
private helper to recover fenced HA resources and the
recompute_online_node_usage(...) was needed here.
As the recovery of fenced HA resources is its own state for HA resources
since c259b1a8 ("manager: make recovery actual state in FSM"), the
change_service_state(...) to 'recovery' will already call
recompute_online_node_usage(...) before, making this call redundant.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 2 --
1 file changed, 2 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 3f81f233..f69a7fe9 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1301,8 +1301,6 @@ sub next_state_recovery {
my $fenced_node = $sd->{node}; # for logging purpose
- $self->recompute_online_node_usage(); # we want the most current node state
-
my $recovery_node = select_service_node(
$self->{rules},
$self->{online_node_usage},
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node from next_state_recovery
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (4 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started Daniel Kral
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
Since 5c2eef4b ("account service to source and target during move") a
moving HA resource's load is accounted for on the source and target
nodes.
A HA resource is in the 'recovery' state after a successfully fenced
node and its primary goal becomes to find a recovery node. As soon as a
recovery node is found, the HA resource is moved there. As the previous
node was fenced, there is only load on the recovery node.
The add_service_usage_to_node(...) is redundant at this point as the
subsequent call to change_service_state(...) to either the 'started' or
'request_stop' state will call recompute_online_node_usage(...) and make
the changes to $online_node_usage be discarded immediately.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 2 --
1 file changed, 2 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index f69a7fe9..6afc9250 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1321,8 +1321,6 @@ sub next_state_recovery {
$fence_recovery_cleanup->($self, $sid, $fenced_node);
$haenv->steal_service($sid, $sd->{node}, $recovery_node);
- $self->{online_node_usage}->add_service_usage_to_node($recovery_node, $sid, $recovery_node);
- $self->{online_node_usage}->add_service_node($sid, $recovery_node);
# NOTE: $sd *is normally read-only*, fencing is the exception
$cd->{node} = $sd->{node} = $recovery_node;
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (5 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node " Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class Daniel Kral
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
Since 5c2eef4b ("account service to source and target during move") a
moving HA resource's load is accounted for on the source and target
nodes.
A HA resource in the 'started' state, which is not configured otherwise
or has no pending CRM commands left to process, actively checks whether
there is "better" node placement by querying select_service_node(...).
When a better node is found, the HA resource will be migrated or
relocated to the found node depending on their type.
The add_service_usage_to_node(...) is redundant at this point as the
subsequent call to change_service_state(...) to either the 'migrate' or
'relocate' state will call recompute_online_node_usage(...) and make
the changes to $online_node_usage be discarded immediately.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 3 ---
1 file changed, 3 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 6afc9250..468e41eb 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1194,9 +1194,6 @@ sub next_state_started {
);
if ($node && ($sd->{node} ne $node)) {
- $self->{online_node_usage}->add_service_usage_to_node($node, $sid, $sd->{node});
- $self->{online_node_usage}->add_service_node($sid, $node);
-
if (defined(my $fallback = $sd->{maintenance_node})) {
if ($node eq $fallback) {
$haenv->log(
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (6 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper Daniel Kral
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
The resource affinity rules need information about the other HA
resource's used nodes to be enacted correctly, which has been proxied
through $online_node_usage before.
The get_used_service_nodes(...) reflects the same logic to retrieve the
nodes, where a HA resource $sid currently puts load on, as in
recompute_online_node_usage(...).
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
I wanted to put this information in PVE::HA::Usage directly, but figured
it would make Usage much more dependent on $ss / $sd.. We can still do
that later on or move the helper elsewhere, e.g. making ServiceStatus
its own class?
src/PVE/HA/Manager.pm | 24 +++++++++++------------
src/PVE/HA/Rules/ResourceAffinity.pm | 15 ++++++++------
src/PVE/HA/Tools.pm | 29 ++++++++++++++++++++++++++++
src/PVE/HA/Usage.pm | 18 -----------------
src/PVE/HA/Usage/Basic.pm | 19 ------------------
src/PVE/HA/Usage/Static.pm | 19 ------------------
src/test/test_failover1.pl | 17 +++++++++-------
7 files changed, 59 insertions(+), 82 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 468e41eb..0226e427 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -121,12 +121,12 @@ sub flush_master_status {
=head3 select_service_node(...)
-=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $sd, $node_preference)
+=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $ss, $node_preference)
Used to select the best fitting node for the service C<$sid>, with the
-configuration C<$service_conf> and state C<$sd>, according to the rules defined
-in C<$rules>, available node utilization in C<$online_node_usage>, and the
-given C<$node_preference>.
+configuration C<$service_conf>, according to the rules defined in C<$rules>,
+available node utilization in C<$online_node_usage>, the service states in
+C<$ss> and the given C<$node_preference>.
The C<$node_preference> can be set to:
@@ -143,11 +143,12 @@ The C<$node_preference> can be set to:
=cut
sub select_service_node {
- my ($rules, $online_node_usage, $sid, $service_conf, $sd, $node_preference) = @_;
+ my ($rules, $online_node_usage, $sid, $service_conf, $ss, $node_preference) = @_;
die "'$node_preference' is not a valid node_preference for select_service_node\n"
if $node_preference !~ m/(none|best-score|try-next)/;
+ my $sd = $ss->{$sid};
my ($current_node, $tried_nodes, $maintenance_fallback) =
$sd->@{qw(node failed_nodes maintenance_node)};
@@ -155,7 +156,8 @@ sub select_service_node {
return undef if !%$pri_nodes;
- my ($together, $separate) = get_resource_affinity($rules, $sid, $online_node_usage);
+ my $online_nodes = { map { $_ => 1 } $online_node_usage->list_nodes() };
+ my ($together, $separate) = get_resource_affinity($rules, $sid, $ss, $online_nodes);
# stay on current node if possible (avoids random migrations)
if (
@@ -281,7 +283,6 @@ sub recompute_online_node_usage {
|| $state eq 'recovery'
) {
$online_node_usage->add_service_usage_to_node($sd->{node}, $sid, $sd->{node});
- $online_node_usage->set_service_node($sid, $sd->{node});
} elsif (
$state eq 'migrate'
|| $state eq 'relocate'
@@ -291,11 +292,9 @@ sub recompute_online_node_usage {
# count it for both, source and target as load is put on both
if ($state ne 'request_start_balance') {
$online_node_usage->add_service_usage_to_node($source, $sid, $source, $target);
- $online_node_usage->add_service_node($sid, $source);
}
if ($online_node_usage->contains_node($target)) {
$online_node_usage->add_service_usage_to_node($target, $sid, $source, $target);
- $online_node_usage->add_service_node($sid, $target);
}
} elsif ($state eq 'stopped' || $state eq 'request_start') {
# do nothing
@@ -308,7 +307,6 @@ sub recompute_online_node_usage {
# case a node dies, as we cannot really know if the to-be-aborted incoming migration
# has already cleaned up all used resources
$online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $target);
- $online_node_usage->set_service_node($sid, $target);
}
}
}
@@ -1022,7 +1020,7 @@ sub next_state_request_start {
$self->{online_node_usage},
$sid,
$cd,
- $sd,
+ $self->{ss},
'best-score',
);
my $select_text = $selected_node ne $current_node ? 'new' : 'current';
@@ -1189,7 +1187,7 @@ sub next_state_started {
$self->{online_node_usage},
$sid,
$cd,
- $sd,
+ $self->{ss},
$select_node_preference,
);
@@ -1303,7 +1301,7 @@ sub next_state_recovery {
$self->{online_node_usage},
$sid,
$cd,
- $sd,
+ $self->{ss},
'best-score',
);
diff --git a/src/PVE/HA/Rules/ResourceAffinity.pm b/src/PVE/HA/Rules/ResourceAffinity.pm
index 9bc039ba..968aa088 100644
--- a/src/PVE/HA/Rules/ResourceAffinity.pm
+++ b/src/PVE/HA/Rules/ResourceAffinity.pm
@@ -5,6 +5,7 @@ use warnings;
use PVE::HA::HashTools qw(set_intersect sets_are_disjoint);
use PVE::HA::Rules;
+use PVE::HA::Tools;
use base qw(Exporter);
use base qw(PVE::HA::Rules);
@@ -496,12 +497,12 @@ sub get_affinitive_resources : prototype($$) {
return ($together, $separate);
}
-=head3 get_resource_affinity($rules, $sid, $online_node_usage)
+=head3 get_resource_affinity($rules, $sid, $ss, $online_nodes)
Returns a list of two hashes, where the first describes the positive resource
affinity and the second hash describes the negative resource affinity for
-resource C<$sid> according to the resource affinity rules in C<$rules> and the
-resource locations in C<$online_node_usage>.
+resource C<$sid> according to the resource affinity rules in C<$rules>, the
+service status C<$ss> and the C<$online_nodes> hash.
For the positive resource affinity of a resource C<$sid>, each element in the
hash represents an online node, where other resources, which C<$sid> is in
@@ -529,8 +530,8 @@ resource C<$sid> is in a negative affinity with, the returned value will be:
=cut
-sub get_resource_affinity : prototype($$$) {
- my ($rules, $sid, $online_node_usage) = @_;
+sub get_resource_affinity : prototype($$$$) {
+ my ($rules, $sid, $ss, $online_nodes) = @_;
my $together = {};
my $separate = {};
@@ -543,7 +544,9 @@ sub get_resource_affinity : prototype($$$) {
for my $csid (keys %{ $rule->{resources} }) {
next if $csid eq $sid;
- my $nodes = $online_node_usage->get_service_nodes($csid);
+ my $used_nodes =
+ PVE::HA::Tools::get_used_service_nodes($ss->{$csid}, $online_nodes);
+ my $nodes = [values %$used_nodes];
next if !$nodes || !@$nodes; # skip unassigned nodes
diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
index 71eb5d0b..7f718e25 100644
--- a/src/PVE/HA/Tools.pm
+++ b/src/PVE/HA/Tools.pm
@@ -188,6 +188,35 @@ sub count_fenced_services {
return $count;
}
+sub get_used_service_nodes {
+ my ($sd, $online_nodes) = @_;
+
+ my $result = {};
+
+ my ($state, $node, $target) = $sd->@{qw(state node target)};
+
+ return $result if $state eq 'stopped' || $state eq 'request_start';
+
+ if (
+ $state eq 'started'
+ || $state eq 'request_stop'
+ || $state eq 'fence'
+ || $state eq 'freeze'
+ || $state eq 'error'
+ || $state eq 'recovery'
+ || $state eq 'migrate'
+ || $state eq 'relocate'
+ ) {
+ $result->{current} = $node if $online_nodes->{$node};
+ }
+
+ if ($state eq 'migrate' || $state eq 'relocate' || $state eq 'request_start_balance') {
+ $result->{target} = $target if defined($target) && $online_nodes->{$target};
+ }
+
+ return $result;
+}
+
sub get_verbose_service_state {
my ($service_state, $service_conf) = @_;
diff --git a/src/PVE/HA/Usage.pm b/src/PVE/HA/Usage.pm
index 7f4d9ca3..66d9572e 100644
--- a/src/PVE/HA/Usage.pm
+++ b/src/PVE/HA/Usage.pm
@@ -27,24 +27,6 @@ sub list_nodes {
die "implement in subclass";
}
-sub get_service_nodes {
- my ($self, $sid) = @_;
-
- die "implement in subclass";
-}
-
-sub set_service_node {
- my ($self, $sid, $nodename) = @_;
-
- die "implement in subclass";
-}
-
-sub add_service_node {
- my ($self, $sid, $nodename) = @_;
-
- die "implement in subclass";
-}
-
sub contains_node {
my ($self, $nodename) = @_;
diff --git a/src/PVE/HA/Usage/Basic.pm b/src/PVE/HA/Usage/Basic.pm
index afe3733c..ead08c54 100644
--- a/src/PVE/HA/Usage/Basic.pm
+++ b/src/PVE/HA/Usage/Basic.pm
@@ -11,7 +11,6 @@ sub new {
return bless {
nodes => {},
haenv => $haenv,
- 'service-nodes' => {},
}, $class;
}
@@ -39,24 +38,6 @@ sub contains_node {
return defined($self->{nodes}->{$nodename});
}
-sub get_service_nodes {
- my ($self, $sid) = @_;
-
- return $self->{'service-nodes'}->{$sid};
-}
-
-sub set_service_node {
- my ($self, $sid, $nodename) = @_;
-
- $self->{'service-nodes'}->{$sid} = [$nodename];
-}
-
-sub add_service_node {
- my ($self, $sid, $nodename) = @_;
-
- push @{ $self->{'service-nodes'}->{$sid} }, $nodename;
-}
-
sub add_service_usage_to_node {
my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
diff --git a/src/PVE/HA/Usage/Static.pm b/src/PVE/HA/Usage/Static.pm
index 6707a54c..061e74a2 100644
--- a/src/PVE/HA/Usage/Static.pm
+++ b/src/PVE/HA/Usage/Static.pm
@@ -22,7 +22,6 @@ sub new {
'service-stats' => {},
haenv => $haenv,
scheduler => $scheduler,
- 'service-nodes' => {},
'service-counts' => {}, # Service count on each node. Fallback if scoring calculation fails.
}, $class;
}
@@ -87,24 +86,6 @@ my sub get_service_usage {
return $service_stats;
}
-sub get_service_nodes {
- my ($self, $sid) = @_;
-
- return $self->{'service-nodes'}->{$sid};
-}
-
-sub set_service_node {
- my ($self, $sid, $nodename) = @_;
-
- $self->{'service-nodes'}->{$sid} = [$nodename];
-}
-
-sub add_service_node {
- my ($self, $sid, $nodename) = @_;
-
- push @{ $self->{'service-nodes'}->{$sid} }, $nodename;
-}
-
sub add_service_usage_to_node {
my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
diff --git a/src/test/test_failover1.pl b/src/test/test_failover1.pl
index 78a001eb..495d4b4b 100755
--- a/src/test/test_failover1.pl
+++ b/src/test/test_failover1.pl
@@ -14,9 +14,10 @@ PVE::HA::Rules::NodeAffinity->register();
PVE::HA::Rules->init(property_isolation => 1);
+my $sid = 'vm:111';
my $rules = PVE::HA::Rules->parse_config("rules.tmp", <<EOD);
node-affinity: prefer_node1
- resources vm:111
+ resources $sid
nodes node1
EOD
@@ -31,10 +32,12 @@ my $service_conf = {
failback => 1,
};
-my $sd = {
- node => $service_conf->{node},
- failed_nodes => undef,
- maintenance_node => undef,
+my $ss = {
+ "$sid" => {
+ node => $service_conf->{node},
+ failed_nodes => undef,
+ maintenance_node => undef,
+ },
};
sub test {
@@ -43,14 +46,14 @@ sub test {
my $select_node_preference = $try_next ? 'try-next' : 'none';
my $node = PVE::HA::Manager::select_service_node(
- $rules, $online_node_usage, "vm:111", $service_conf, $sd, $select_node_preference,
+ $rules, $online_node_usage, "$sid", $service_conf, $ss, $select_node_preference,
);
my (undef, undef, $line) = caller();
die "unexpected result: $node != ${expected_node} at line $line\n"
if $node ne $expected_node;
- $sd->{node} = $node;
+ $ss->{$sid}->{node} = $node;
}
test('node1');
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (7 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations Daniel Kral
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
As the previously introduced PVE::HA::Tools::get_used_service_nodes(...)
helper reflects the same logic as in recompute_online_node_usage(...),
use the helper instead to reduce code duplication.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 53 +++++++++----------------------------------
1 file changed, 11 insertions(+), 42 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 0226e427..d0d4d0a5 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -240,7 +240,7 @@ sub recompute_online_node_usage {
my $haenv = $self->{haenv};
- my $online_nodes = $self->{ns}->list_online_nodes();
+ my $online_nodes = { map { $_ => 1 } $self->{ns}->list_online_nodes()->@* };
my $online_node_usage;
@@ -248,7 +248,7 @@ sub recompute_online_node_usage {
if ($mode eq 'static') {
$online_node_usage = eval {
my $scheduler = PVE::HA::Usage::Static->new($haenv);
- $scheduler->add_node($_) for $online_nodes->@*;
+ $scheduler->add_node($_) for keys $online_nodes->%*;
$haenv->update_static_service_stats();
return $scheduler;
};
@@ -266,49 +266,18 @@ sub recompute_online_node_usage {
# fallback to the basic algorithm in any case
if (!$online_node_usage) {
$online_node_usage = PVE::HA::Usage::Basic->new($haenv);
- $online_node_usage->add_node($_) for $online_nodes->@*;
+ $online_node_usage->add_node($_) for keys $online_nodes->%*;
}
- foreach my $sid (sort keys %{ $self->{ss} }) {
+ for my $sid (sort keys $self->{ss}->%*) {
my $sd = $self->{ss}->{$sid};
- my $state = $sd->{state};
- my $target = $sd->{target}; # optional
- if ($online_node_usage->contains_node($sd->{node})) {
- if (
- $state eq 'started'
- || $state eq 'request_stop'
- || $state eq 'fence'
- || $state eq 'freeze'
- || $state eq 'error'
- || $state eq 'recovery'
- ) {
- $online_node_usage->add_service_usage_to_node($sd->{node}, $sid, $sd->{node});
- } elsif (
- $state eq 'migrate'
- || $state eq 'relocate'
- || $state eq 'request_start_balance'
- ) {
- my $source = $sd->{node};
- # count it for both, source and target as load is put on both
- if ($state ne 'request_start_balance') {
- $online_node_usage->add_service_usage_to_node($source, $sid, $source, $target);
- }
- if ($online_node_usage->contains_node($target)) {
- $online_node_usage->add_service_usage_to_node($target, $sid, $source, $target);
- }
- } elsif ($state eq 'stopped' || $state eq 'request_start') {
- # do nothing
- } else {
- die "should not be reached (sid = '$sid', state = '$state')";
- }
- } elsif (defined($target) && $online_node_usage->contains_node($target)) {
- if ($state eq 'migrate' || $state eq 'relocate') {
- # to correctly track maintenance modi and also consider the target as used for the
- # case a node dies, as we cannot really know if the to-be-aborted incoming migration
- # has already cleaned up all used resources
- $online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $target);
- }
- }
+ my $used_nodes = PVE::HA::Tools::get_used_service_nodes($sd, $online_nodes);
+ my ($current, $target) = $used_nodes->@{qw(current target)};
+
+ $online_node_usage->add_service_usage_to_node($current, $sid, $sd->{node}, $sd->{target})
+ if $current;
+ $online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $sd->{target})
+ if $target;
}
$self->{online_node_usage} = $online_node_usage;
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (8 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular Daniel Kral
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
This makes use of the new signature for add_service_usage_to_node(...)
from the Proxmox::RS::ResourceScheduling::Static package, which allows
tracking of which HA resources have been assigned to which nodes.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Usage.pm | 6 ++++++
src/PVE/HA/Usage/Basic.pm | 14 +++++++++++---
src/PVE/HA/Usage/Static.pm | 26 ++++++++++++++++++++------
3 files changed, 37 insertions(+), 9 deletions(-)
diff --git a/src/PVE/HA/Usage.pm b/src/PVE/HA/Usage.pm
index 66d9572e..f19fae76 100644
--- a/src/PVE/HA/Usage.pm
+++ b/src/PVE/HA/Usage.pm
@@ -40,6 +40,12 @@ sub add_service_usage_to_node {
die "implement in subclass";
}
+sub remove_service_usage {
+ my ($self, $sid) = @_;
+
+ die "implement in subclass";
+}
+
# Returns a hash with $nodename => $score pairs. A lower $score is better.
sub score_nodes_to_start_service {
my ($self, $sid, $service_node) = @_;
diff --git a/src/PVE/HA/Usage/Basic.pm b/src/PVE/HA/Usage/Basic.pm
index ead08c54..9b80661a 100644
--- a/src/PVE/HA/Usage/Basic.pm
+++ b/src/PVE/HA/Usage/Basic.pm
@@ -17,7 +17,7 @@ sub new {
sub add_node {
my ($self, $nodename) = @_;
- $self->{nodes}->{$nodename} = 0;
+ $self->{nodes}->{$nodename} = {};
}
sub remove_node {
@@ -42,7 +42,7 @@ sub add_service_usage_to_node {
my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
if ($self->contains_node($nodename)) {
- $self->{nodes}->{$nodename}++;
+ $self->{nodes}->{$nodename}->{$sid} = 1;
} else {
$self->{haenv}->log(
'warning',
@@ -51,10 +51,18 @@ sub add_service_usage_to_node {
}
}
+sub remove_service_usage {
+ my ($self, $sid) = @_;
+
+ for my $node ($self->list_nodes()) {
+ delete $self->{nodes}->{$node}->{$sid};
+ }
+}
+
sub score_nodes_to_start_service {
my ($self, $sid, $service_node) = @_;
- return $self->{nodes};
+ return { map { $_ => scalar keys $self->{nodes}->{$_}->%* } keys $self->{nodes}->%* };
}
1;
diff --git a/src/PVE/HA/Usage/Static.pm b/src/PVE/HA/Usage/Static.pm
index 061e74a2..2867d4ff 100644
--- a/src/PVE/HA/Usage/Static.pm
+++ b/src/PVE/HA/Usage/Static.pm
@@ -22,14 +22,14 @@ sub new {
'service-stats' => {},
haenv => $haenv,
scheduler => $scheduler,
- 'service-counts' => {}, # Service count on each node. Fallback if scoring calculation fails.
+ 'node-services' => {}, # Services on each node. Fallback if scoring calculation fails.
}, $class;
}
sub add_node {
my ($self, $nodename) = @_;
- $self->{'service-counts'}->{$nodename} = 0;
+ $self->{'node-services'}->{$nodename} = {};
my $stats = $self->{'node-stats'}->{$nodename}
or die "did not get static node usage information for '$nodename'\n";
@@ -43,7 +43,7 @@ sub add_node {
sub remove_node {
my ($self, $nodename) = @_;
- delete $self->{'service-counts'}->{$nodename};
+ delete $self->{'node-services'}->{$nodename};
$self->{scheduler}->remove_node($nodename);
}
@@ -89,16 +89,27 @@ my sub get_service_usage {
sub add_service_usage_to_node {
my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
- $self->{'service-counts'}->{$nodename}++;
+ $self->{'node-services'}->{$nodename}->{$sid} = 1;
eval {
my $service_usage = get_service_usage($self, $sid, $service_node, $migration_target);
- $self->{scheduler}->add_service_usage_to_node($nodename, $service_usage);
+ $self->{scheduler}->add_service_usage_to_node($nodename, $sid, $service_usage);
};
$self->{haenv}->log('warning', "unable to add service '$sid' usage to node '$nodename' - $@")
if $@;
}
+sub remove_service_usage {
+ my ($self, $sid) = @_;
+
+ delete $self->{'node-services'}->{$_}->{$sid} for $self->list_nodes();
+
+ eval { $self->{scheduler}->remove_service_usage($sid) };
+ $self->{haenv}->log('warning', "unable to remove service '$sid' usage - $@") if $@;
+
+ delete $self->{'service-stats'}->{$sid}; # Invalidate old service stats
+}
+
sub score_nodes_to_start_service {
my ($self, $sid, $service_node) = @_;
@@ -111,7 +122,10 @@ sub score_nodes_to_start_service {
'err',
"unable to score nodes according to static usage for service '$sid' - $err",
);
- return $self->{'service-counts'};
+ return {
+ map { $_ => scalar keys $self->{'node-services'}->{$_}->%* }
+ keys $self->{'node-services'}->%*
+ };
}
# Take minus the value, so that a lower score is better, which our caller(s) expect(s).
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (9 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular Daniel Kral
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
The HA Manager builds $online_node_usage in every FSM iteration in
manage(...) and at every HA resource state change in
change_service_state(...). This becomes quite costly with a high HA
resource count and a lot of state changes happening at once, e.g.
starting up multiple nodes with rebalance_on_request_start set or a
failover of a node with many configured HA resources.
To improve this situation, make the changes to the $online_node_usage
more granular by building $online_node_usage only once per call to
manage(...) and changing the nodes a HA resource uses individually on
every HA resource state transition.
The change in service usage "freshness" should be negligible here as the
static service usage data is cached anyway (except if the cache fails
for some reason).
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
The add_service_usage(...) helper is added in anticipation for the next
patch, we don't need a helper if we don't go for #9.
src/PVE/HA/Manager.pm | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index d0d4d0a5..253deba9 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -77,6 +77,21 @@ sub new {
return $self;
}
+sub add_service_usage {
+ my ($self, $sid, $sd) = @_;
+
+ my $online_node_usage = $self->{online_node_usage};
+
+ my $online_nodes = { map { $_ => 1 } $online_node_usage->list_nodes() };
+ my $nodes = PVE::HA::Tools::get_used_service_nodes($sd, $online_nodes);
+
+ my ($current, $target) = $nodes->@{qw(current target)};
+ $online_node_usage->add_service_usage_to_node($current, $sid, $sd->{node}, $sd->{target})
+ if $current;
+ $online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $sd->{target})
+ if $target;
+}
+
sub update_crs_scheduler_mode {
my ($self) = @_;
@@ -314,7 +329,8 @@ my $change_service_state = sub {
$sd->{$k} = $v;
}
- $self->recompute_online_node_usage();
+ $self->{online_node_usage}->remove_service_usage($sid);
+ $self->add_service_usage($sid, $sd);
$sd->{uid} = compute_new_uuid($new_state);
@@ -706,6 +722,8 @@ sub manage {
delete $ss->{$sid};
}
+ $self->recompute_online_node_usage();
+
my $new_rules = $haenv->read_rules_config();
# TODO PVE 10: Remove group migration when HA groups have been fully migrated to rules
@@ -735,8 +753,6 @@ sub manage {
for (;;) {
my $repeat = 0;
- $self->recompute_online_node_usage();
-
foreach my $sid (sort keys %$ss) {
my $sd = $ss->{$sid};
my $cd = $sc->{$sid} || { state => 'disabled' };
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (10 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
11 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
The $online_node_usage is built on every call to manage(...) now, but
can be reduced to only be built on any scheduler mode change (including
initialization or error path to be complete).
This allows recompute_online_node_usage(...) to be reduced to
adding/removing nodes whenever these become online or are not online
anymore and handle the service usage updates whenever these change.
Therefore, recompute_online_node_usage(...) must only be called once in
manage(...) after $ns was properly updated.
Note that this makes the ha-manager not acknowledge any hotplug changes
to the guest configs anymore as long as the HA resource state doesn't
change.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
If we go for this patch, then we would need some mechanism to update the
static usage for a single or all HA resources registered in
$online_node_usage at once (or just rebuilt $online_node_usage at that
point..).
src/PVE/HA/Manager.pm | 90 +++++++++++++++++++++++--------------------
1 file changed, 49 insertions(+), 41 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 253deba9..6fadb3f3 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -106,6 +106,7 @@ sub update_crs_scheduler_mode {
if (!defined($old_mode)) {
$haenv->log('info', "using scheduler mode '$new_mode'") if $new_mode ne 'basic';
} elsif ($new_mode eq $old_mode) {
+ $haenv->update_static_service_stats() if $old_mode eq 'static';
return; # nothing to do
} else {
$haenv->log('info', "switching scheduler mode from '$old_mode' to '$new_mode'");
@@ -113,6 +114,39 @@ sub update_crs_scheduler_mode {
$self->{crs}->{scheduler} = $new_mode;
+ my $online_node_usage;
+
+ if ($new_mode eq 'static') {
+ $online_node_usage = eval {
+ my $scheduler = PVE::HA::Usage::Static->new($haenv);
+ $scheduler->add_node($_) for $self->{ns}->list_online_nodes()->@*;
+ $haenv->update_static_service_stats();
+ return $scheduler;
+ };
+ if ($@) {
+ $self->{crs}->{scheduler} = 'basic'; # retry on next update
+ $haenv->log(
+ 'warning',
+ "fallback to 'basic' scheduler mode, init for 'static' failed - $@",
+ );
+ }
+ } elsif ($new_mode eq 'basic') {
+ # handled below in the general fall-back case
+ } else {
+ $haenv->log('warning', "got unknown scheduler mode '$new_mode', using 'basic'");
+ }
+
+ # fallback to the basic algorithm in any case
+ if (!$online_node_usage) {
+ $online_node_usage = PVE::HA::Usage::Basic->new($haenv);
+ $online_node_usage->add_node($_) for $self->{ns}->list_online_nodes()->@*;
+ }
+
+ $self->{online_node_usage} = $online_node_usage;
+
+ # initialize with current nodes and services states
+ $self->add_service_usage($_, $self->{ss}->{$_}) for keys $self->{ss}->%*;
+
return;
}
@@ -253,49 +287,19 @@ my $valid_service_states = {
sub recompute_online_node_usage {
my ($self) = @_;
- my $haenv = $self->{haenv};
+ my ($haenv, $ns) = $self->@{qw(haenv ns)};
- my $online_nodes = { map { $_ => 1 } $self->{ns}->list_online_nodes()->@* };
+ for my $node ($self->{online_node_usage}->list_nodes()) {
+ next if $ns->node_is_online($node);
- my $online_node_usage;
-
- if (my $mode = $self->{crs}->{scheduler}) {
- if ($mode eq 'static') {
- $online_node_usage = eval {
- my $scheduler = PVE::HA::Usage::Static->new($haenv);
- $scheduler->add_node($_) for keys $online_nodes->%*;
- $haenv->update_static_service_stats();
- return $scheduler;
- };
- $haenv->log(
- 'warning',
- "fallback to 'basic' scheduler mode, init for 'static' failed - $@",
- ) if $@;
- } elsif ($mode eq 'basic') {
- # handled below in the general fall-back case
- } else {
- $haenv->log('warning', "got unknown scheduler mode '$mode', using 'basic'");
- }
+ $self->{online_node_usage}->remove_node($node);
}
- # fallback to the basic algorithm in any case
- if (!$online_node_usage) {
- $online_node_usage = PVE::HA::Usage::Basic->new($haenv);
- $online_node_usage->add_node($_) for keys $online_nodes->%*;
+ for my $node ($ns->list_online_nodes()->@*) {
+ next if $self->{online_node_usage}->contains_node($node);
+
+ $self->{online_node_usage}->add_node($node);
}
-
- for my $sid (sort keys $self->{ss}->%*) {
- my $sd = $self->{ss}->{$sid};
- my $used_nodes = PVE::HA::Tools::get_used_service_nodes($sd, $online_nodes);
- my ($current, $target) = $used_nodes->@{qw(current target)};
-
- $online_node_usage->add_service_usage_to_node($current, $sid, $sd->{node}, $sd->{target})
- if $current;
- $online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $sd->{target})
- if $target;
- }
-
- $self->{online_node_usage} = $online_node_usage;
}
my $change_service_state = sub {
@@ -693,6 +697,8 @@ sub manage {
$self->{groups} = $haenv->read_group_config(); # update
+ $self->recompute_online_node_usage();
+
# compute new service status
# add new service
@@ -704,11 +710,13 @@ sub manage {
$haenv->log('info', "adding new service '$sid' on node '$cd->{node}'");
# assume we are running to avoid relocate running service at add
my $state = ($cd->{state} eq 'started') ? 'request_start' : 'request_stop';
- $ss->{$sid} = {
+ my $sd = $ss->{$sid} = {
state => $state,
node => $cd->{node},
uid => compute_new_uuid('started'),
};
+
+ $self->add_service_usage($sid, $sd);
}
# remove stale or ignored services from manager state
@@ -718,12 +726,12 @@ sub manage {
my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
$haenv->log('info', "removing stale service '$sid' ($reason)");
+ $self->{online_node_usage}->remove_service_usage($sid);
+
# remove all service related state information
delete $ss->{$sid};
}
- $self->recompute_online_node_usage();
-
my $new_rules = $haenv->read_rules_config();
# TODO PVE 10: Remove group migration when HA groups have been fully migrated to rules
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-09-30 14:21 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node " Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular Daniel Kral
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox