* [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting
@ 2025-09-30 14:19 Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
` (12 more replies)
0 siblings, 13 replies; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
An initial RFC for making the changes to the HA $online_node_usage
object more granular.
The current implementation rebuilds the $online_node_usage object quite
often. The most significant event causing a rebuild is changing a HA
resource's state.
One worst-case scenario I tested was a 3-node cluster using the static
load scheduler with rebalance_on_request_start set with a varying amount
of homogenous HA resources equally distributed on the nodes at startup.
When these are all configured to start and rebalance at the same time,
2/3 of them will move to different nodes in a round-robin fashion: Each
HA resource is then added in the next manage(...) call to the Manager
status' services in state 'request_start' and 1/3 will change to state
'started' while 2/3 will change to state 'request_start_balance' (all
calling select_service_node(...) at least once to make that decision).
The biggest bottleneck here is how many guest configurations need to be
read and parsed to retrieve the necessary static information on each
rebuilt $online_node_usage, which increases in each HA resource being
handled, which is addressed with the following patches.
= Patches =
qemu-server patch #1 fetches default values only when needed
proxmox patch #1 necessary for pve-rs patch
pve-rs patch #1 allow removing service usage from nodes
ha-manager patch #1 implement static cache and use
PVE::Cluster::get_guest_config_properties(...)
ha-manager patch #2-#4 remove redundant $online_node_usage updates
ha-manager patch #5-#6 some decoupling and refactoring
--- next patches need proxmox #1 and proxmox-perl-rs #1 ---
ha-manager patch #7-#8 setup $online_node_usage only once per round and
make changes granular inbetween
ha-manager patch #9 setup $online_node_usage only when the scheduler
mode has been changed
(will not acknowledge changes to static stats for
any HA resource which is already running and
hasn't changed its state yet, see TODO for more)
= Benchmarks =
Here are some rudimentary benchmarks with the above setup (3 nodes
cluster, static load scheduler, rebalance_on_request_start set) in a
virtualized environment. The columns are for HA resource count and the
rows are for different patches applied (qm #1 = qemu-server patch #1).
Run-times for the first manage(...) call to rebalance HA resources:
300 3,000 9,000
master 27.2 s - -
#1 33.0 s - -
#8 605 ms 5.45 s 11.7 s
#8 + qm #1 248 ms 2.62 s 4.07 s
Average and total run-times for
PVE::HA::Usage::Static::score_nodes_to_start_service(...):
300 3,000 9,000
master 2.49 ms (0.7 s) - -
#1 454 µs (136 ms) - -
#8 303 µs (91 ms) 367 µs (1.1 s) 429 µs (3.87 s)
#8 + qm #1 102 µs (30 ms) 102 µs (305 ms) 110 µs (991 ms)
I haven't included #9 here, because it doesn't acknowledge hotplug guest
changes and improves the amortized runtime, not the single run benchmark
from above, where #9 needs to do roughly the same work as #8.
I ran `sync && echo 3 > /proc/sys/vm/drop_caches` before each of the
following runs to write dirty pages and clear the slab + page cache, but
I haven't looked into memdb / fuse too closely if these take any effect
here. Either way, the benchmarks aren't that isolated anyway here and
should still be taken with a big pile of salt.
= TODO =
- patch #9 doesn't acknowledge any hotplug changes to cores / cpulimit /
memory yet; the ha-manager would need some external notification for
that (inotify or a 'invalidate_cache' / 'service $sid update' crm
command?), but maybe that's too much and wouldn't bring any significant
improvements.. Especially when thinking about an upcoming 'dynamic'
mode, these stats updates would become way more common than with static
resource changes, so #8 + qm #1 might be enough for now..
- some more performance improvements
there is still some room and potential to improve the performance here
and there, e.g. buffering syslog writes, caching the parsed + default
values for the guests, make {parse,get_current}_memory faster, but I
wanted to get some reviews in first; also these benchmarks are on
master without the HA rules performance improvements from another
series
- a more automated benchmark infrastructure
I have added some (simulated) benchmark scripts in my local tree, which
I want to add to a future revision when these are more polished, so we
can have a better watch on the runtime when changes are introduced; In
the future it would also be great to have some automated benchmarks /
test cases that can be run on actual clusters without the hassle to set
them up by hand
= Future ideas =
- Acknowledge node CPU + memory hotplug?
- When rebalancing HA resources on startup, it might be better to make
the decision for multiple HA resources at the same time, so that for
homogenous nodes and services aren't rebalanced in a round-robin
fashion as above, even though that example might be exaggerated
qemu-server:
Daniel Kral (1):
config: only fetch necessary default values in get_derived_property
helper
src/PVE/QemuConfig.pm | 8 +++-----
src/PVE/QemuServer.pm | 6 ++++++
2 files changed, 9 insertions(+), 5 deletions(-)
proxmox:
Daniel Kral (1):
resource-scheduling: change score_nodes_to_start_service signature
proxmox-resource-scheduling/src/pve_static.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
perl-rs:
Daniel Kral (1):
pve-rs: resource_scheduling: allow granular usage changes
.../bindings/resource_scheduling_static.rs | 98 ++++++++++++++++---
pve-rs/test/resource_scheduling.pl | 84 ++++++++++++++--
2 files changed, 160 insertions(+), 22 deletions(-)
ha-manager:
Daniel Kral (9):
implement static service stats cache
manager: remove redundant recompute_online_node_usage from
next_state_recovery
manager: remove redundant add_service_usage_to_node from
next_state_recovery
manager: remove redundant add_service_usage_to_node from
next_state_started
rules: resource affinity: decouple get_resource_affinity helper from
Usage class
manager: make recompute_online_node_usage use get_service_nodes helper
usage: allow granular changes to Usage implementations
manager: make online node usage computation granular
manager: make service node usage computation more granular
src/PVE/HA/Env.pm | 12 ++
src/PVE/HA/Env/PVE2.pm | 21 ++++
src/PVE/HA/Manager.pm | 167 ++++++++++++---------------
src/PVE/HA/Resources/PVECT.pm | 3 +-
src/PVE/HA/Resources/PVEVM.pm | 3 +-
src/PVE/HA/Rules/ResourceAffinity.pm | 15 ++-
src/PVE/HA/Sim/Env.pm | 12 ++
src/PVE/HA/Sim/Hardware.pm | 31 +++--
src/PVE/HA/Sim/Resources.pm | 4 +-
src/PVE/HA/Tools.pm | 29 +++++
src/PVE/HA/Usage.pm | 24 +---
src/PVE/HA/Usage/Basic.pm | 33 ++----
src/PVE/HA/Usage/Static.pm | 45 ++++----
src/test/test_failover1.pl | 17 +--
14 files changed, 233 insertions(+), 183 deletions(-)
Summary over all repositories:
19 files changed, 403 insertions(+), 211 deletions(-)
--
Generated by git-murpp 0.8.0
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-15 14:31 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature Daniel Kral
` (11 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
get_derived_property(...) is called in the semi-hot path of the HA
Manager's static load scheduler to retrieve the static stats of each VM.
As the defaults are only needed in certain cases and for a very small
subset of properties in the VM config, get those separately when needed.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
get_current_memory(...) is still quite costly here, because it calls
parse_memory(...), which calls
PVE::JSONSchema::parse_property_string(...), which adds up for many
guest configurations parsed in every manage(...) call, but this already
helps quite a lot.
src/PVE/QemuConfig.pm | 8 +++-----
src/PVE/QemuServer.pm | 6 ++++++
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/src/PVE/QemuConfig.pm b/src/PVE/QemuConfig.pm
index d0844c4c..078c87e0 100644
--- a/src/PVE/QemuConfig.pm
+++ b/src/PVE/QemuConfig.pm
@@ -582,12 +582,10 @@ sub load_current_config {
sub get_derived_property {
my ($class, $conf, $name) = @_;
- my $defaults = PVE::QemuServer::load_defaults();
-
if ($name eq 'max-cpu') {
- my $cpus =
- ($conf->{sockets} || $defaults->{sockets}) * ($conf->{cores} || $defaults->{cores});
- return $conf->{vcpus} || $cpus;
+ my $sockets = $conf->{sockets} || PVE::QemuServer::get_default_property_value('sockets');
+ my $cores = $conf->{cores} || PVE::QemuServer::get_default_property_value('cores');
+ return $conf->{vcpus} || ($sockets * $cores);
} elsif ($name eq 'max-memory') { # current usage maximum, not maximum hotpluggable
return get_current_memory($conf->{memory}) * 1024 * 1024;
} else {
diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index 7d5ab718..459ba8f0 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -2314,6 +2314,12 @@ sub write_vm_config {
return $raw;
}
+sub get_default_property_value {
+ my ($name) = @_;
+
+ return $confdesc->{$name}->{default};
+}
+
sub load_defaults {
my $res = {};
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes Daniel Kral
` (10 subsequent siblings)
12 siblings, 0 replies; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
This is needed as StaticNodeUsage is created in each invocation of
PVE::RS::ResourceScheduling::Static::score_nodes_to_start_service now.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
Since it has only one callee, I haven't bothered to make this take
either a borrow or a reference..
proxmox-resource-scheduling/src/pve_static.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/proxmox-resource-scheduling/src/pve_static.rs b/proxmox-resource-scheduling/src/pve_static.rs
index d39614cd..fc40cb5c 100644
--- a/proxmox-resource-scheduling/src/pve_static.rs
+++ b/proxmox-resource-scheduling/src/pve_static.rs
@@ -70,7 +70,7 @@ criteria_struct! {
/// Returns a vector of (nodename, score) pairs. Scores are between 0.0 and 1.0 and a higher score
/// is better.
pub fn score_nodes_to_start_service(
- nodes: &[&StaticNodeUsage],
+ nodes: &[StaticNodeUsage],
service: &StaticServiceUsage,
) -> Result<Vec<(String, f64)>, Error> {
let len = nodes.len();
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-16 10:32 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache Daniel Kral
` (9 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
Implements a simple bidirectional map to track which service usages have
been added to nodes, so that these can be removed later individually.
The StaticServiceUsage is added to the HashMap<> in StaticNodeInfo to
reduce unnecessary indirection when summing these values in
score_nodes_to_start_service(...).
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
I started out adding and removing the service_usage in StaticNodeUsage
on each add_service_usage_to_node(...) and remove_service_usage(...)
call individually, but I think summing those up every time is better for
numerical stability..
.../bindings/resource_scheduling_static.rs | 98 ++++++++++++++++---
pve-rs/test/resource_scheduling.pl | 84 ++++++++++++++--
2 files changed, 160 insertions(+), 22 deletions(-)
diff --git a/pve-rs/src/bindings/resource_scheduling_static.rs b/pve-rs/src/bindings/resource_scheduling_static.rs
index 7cf2f35..535dfe3 100644
--- a/pve-rs/src/bindings/resource_scheduling_static.rs
+++ b/pve-rs/src/bindings/resource_scheduling_static.rs
@@ -16,8 +16,16 @@ pub mod pve_rs_resource_scheduling_static {
perlmod::declare_magic!(Box<Scheduler> : &Scheduler as "PVE::RS::ResourceScheduling::Static");
+ struct StaticNodeInfo {
+ name: String,
+ maxcpu: usize,
+ maxmem: usize,
+ services: HashMap<String, StaticServiceUsage>,
+ }
+
struct Usage {
- nodes: HashMap<String, StaticNodeUsage>,
+ nodes: HashMap<String, StaticNodeInfo>,
+ service_nodes: HashMap<String, Vec<String>>,
}
/// A scheduler instance contains the resource usage by node.
@@ -30,6 +38,7 @@ pub mod pve_rs_resource_scheduling_static {
pub fn new(#[raw] class: Value) -> Result<Value, Error> {
let inner = Usage {
nodes: HashMap::new(),
+ service_nodes: HashMap::new(),
};
Ok(perlmod::instantiate_magic!(
@@ -39,7 +48,7 @@ pub mod pve_rs_resource_scheduling_static {
/// Method: Add a node with its basic CPU and memory info.
///
- /// This inserts a [`StaticNodeUsage`] entry for the node into the scheduler instance.
+ /// This inserts a [`StaticNodeInfo`] entry for the node into the scheduler instance.
#[export]
pub fn add_node(
#[try_from_ref] this: &Scheduler,
@@ -53,12 +62,11 @@ pub mod pve_rs_resource_scheduling_static {
bail!("node {} already added", nodename);
}
- let node = StaticNodeUsage {
+ let node = StaticNodeInfo {
name: nodename.clone(),
- cpu: 0.0,
maxcpu,
- mem: 0,
maxmem,
+ services: HashMap::new(),
};
usage.nodes.insert(nodename, node);
@@ -67,10 +75,25 @@ pub mod pve_rs_resource_scheduling_static {
/// Method: Remove a node from the scheduler.
#[export]
- pub fn remove_node(#[try_from_ref] this: &Scheduler, nodename: &str) {
+ pub fn remove_node(#[try_from_ref] this: &Scheduler, nodename: &str) -> Result<(), Error> {
let mut usage = this.inner.lock().unwrap();
- usage.nodes.remove(nodename);
+ if let Some(node) = usage.nodes.remove(nodename) {
+ for (sid, _) in node.services.iter() {
+ match usage.service_nodes.get_mut(sid) {
+ Some(nodes) => {
+ nodes.retain(|node| !node.eq(nodename));
+ }
+ None => bail!(
+ "service '{}' not present while removing node '{}'",
+ sid,
+ nodename
+ ),
+ }
+ }
+ }
+
+ Ok(())
}
/// Method: Get a list of all the nodes in the scheduler.
@@ -93,22 +116,55 @@ pub mod pve_rs_resource_scheduling_static {
usage.nodes.contains_key(nodename)
}
- /// Method: Add usage of `service` to the node's usage.
+ /// Method: Add service `sid` and its `service_usage` to the node.
#[export]
pub fn add_service_usage_to_node(
#[try_from_ref] this: &Scheduler,
nodename: &str,
- service: StaticServiceUsage,
+ sid: &str,
+ service_usage: StaticServiceUsage,
) -> Result<(), Error> {
let mut usage = this.inner.lock().unwrap();
match usage.nodes.get_mut(nodename) {
Some(node) => {
- node.add_service_usage(&service);
- Ok(())
+ if node.services.contains_key(sid) {
+ bail!("service '{}' already added to node '{}'", sid, nodename);
+ }
+
+ node.services.insert(sid.to_string(), service_usage);
}
None => bail!("node '{}' not present in usage hashmap", nodename),
}
+
+ if let Some(nodes) = usage.service_nodes.get_mut(sid) {
+ nodes.push(nodename.to_string());
+ } else {
+ usage
+ .service_nodes
+ .insert(sid.to_string(), vec![nodename.to_string()]);
+ }
+
+ Ok(())
+ }
+
+ /// Method: Remove service `sid` and its usage from all assigned nodes.
+ #[export]
+ fn remove_service_usage(#[try_from_ref] this: &Scheduler, sid: &str) -> Result<(), Error> {
+ let mut usage = this.inner.lock().unwrap();
+
+ if let Some(nodes) = usage.service_nodes.remove(sid) {
+ for nodename in &nodes {
+ match usage.nodes.get_mut(nodename) {
+ Some(node) => {
+ node.services.remove(sid);
+ }
+ None => bail!("service '{}' not present on node '{}'", sid, nodename),
+ }
+ }
+ }
+
+ Ok(())
}
/// Scores all previously added nodes for starting a `service` on.
@@ -126,7 +182,25 @@ pub mod pve_rs_resource_scheduling_static {
service: StaticServiceUsage,
) -> Result<Vec<(String, f64)>, Error> {
let usage = this.inner.lock().unwrap();
- let nodes = usage.nodes.values().collect::<Vec<&StaticNodeUsage>>();
+ let nodes = usage
+ .nodes
+ .values()
+ .map(|node| {
+ let mut node_usage = StaticNodeUsage {
+ name: node.name.to_string(),
+ cpu: 0.0,
+ maxcpu: node.maxcpu,
+ mem: 0,
+ maxmem: node.maxmem,
+ };
+
+ for service in node.services.values() {
+ node_usage.add_service_usage(service);
+ }
+
+ node_usage
+ })
+ .collect::<Vec<StaticNodeUsage>>();
proxmox_resource_scheduling::pve_static::score_nodes_to_start_service(&nodes, &service)
}
diff --git a/pve-rs/test/resource_scheduling.pl b/pve-rs/test/resource_scheduling.pl
index e3b7d2e..318238e 100755
--- a/pve-rs/test/resource_scheduling.pl
+++ b/pve-rs/test/resource_scheduling.pl
@@ -7,6 +7,20 @@ use Test::More;
use PVE::RS::ResourceScheduling::Static;
+my sub score_nodes {
+ my ($static, $service) = @_;
+
+ my $score_list = $static->score_nodes_to_start_service($service);
+
+ # imitate HA manager
+ my $scores = { map { $_->[0] => -$_->[1] } $score_list->@* };
+ my @nodes = sort {
+ $scores->{$a} <=> $scores->{$b} || $a cmp $b
+ } keys $scores->%*;
+
+ return @nodes;
+}
+
sub test_basic {
my $static = PVE::RS::ResourceScheduling::Static->new();
is(scalar($static->list_nodes()->@*), 0, 'node list empty');
@@ -33,6 +47,7 @@ sub test_balance {
maxmem => 20_000_000_000,
};
+ my @sid_names = ("a" .. "z");
for (my $i = 0; $i < 15; $i++) {
my $score_list = $static->score_nodes_to_start_service($service);
@@ -50,7 +65,54 @@ sub test_balance {
is($nodes[1], "A", 'second should be A');
}
- $static->add_service_usage_to_node($nodes[0], $service);
+ $static->add_service_usage_to_node($nodes[0], $sid_names[$i], $service);
+ }
+}
+
+sub test_balance_removal {
+ my $static = PVE::RS::ResourceScheduling::Static->new();
+ $static->add_node("A", 10, 100_000_000_000);
+ $static->add_node("B", 20, 200_000_000_000);
+ $static->add_node("C", 30, 300_000_000_000);
+
+ my $service = {
+ maxcpu => 4,
+ maxmem => 20_000_000_000,
+ };
+
+ $static->add_service_usage_to_node("A", "a", $service);
+ $static->add_service_usage_to_node("A", "b", $service);
+ $static->add_service_usage_to_node("B", "c", $service);
+ $static->add_service_usage_to_node("B", "d", $service);
+ $static->add_service_usage_to_node("C", "c", $service);
+
+ {
+ my @nodes = score_nodes($static, $service);
+
+ is($nodes[0], "C");
+ is($nodes[1], "B");
+ is($nodes[2], "A");
+ }
+
+ $static->remove_service_usage("d");
+ $static->remove_service_usage("c");
+ $static->add_service_usage_to_node("C", "c", $service);
+
+ {
+ my @nodes = score_nodes($static, $service);
+
+ is($nodes[0], "B");
+ is($nodes[1], "C");
+ is($nodes[2], "A");
+ }
+
+ $static->remove_node("B");
+
+ {
+ my @nodes = score_nodes($static, $service);
+
+ is($nodes[0], "C");
+ is($nodes[1], "A");
}
}
@@ -66,11 +128,11 @@ sub test_overcommitted {
maxmem => 536_870_912,
};
- $static->add_service_usage_to_node("A", $service);
- $static->add_service_usage_to_node("A", $service);
- $static->add_service_usage_to_node("A", $service);
- $static->add_service_usage_to_node("B", $service);
- $static->add_service_usage_to_node("A", $service);
+ $static->add_service_usage_to_node("A", "a", $service);
+ $static->add_service_usage_to_node("A", "b", $service);
+ $static->add_service_usage_to_node("A", "c", $service);
+ $static->add_service_usage_to_node("B", "d", $service);
+ $static->add_service_usage_to_node("A", "e", $service);
my $score_list = $static->score_nodes_to_start_service($service);
@@ -96,9 +158,9 @@ sub test_balance_small_memory_difference {
$static->add_node("C", 4, 8_000_000_000);
if ($with_start_load) {
- $static->add_service_usage_to_node("A", { maxcpu => 4, maxmem => 1_000_000_000 });
- $static->add_service_usage_to_node("B", { maxcpu => 2, maxmem => 1_000_000_000 });
- $static->add_service_usage_to_node("C", { maxcpu => 2, maxmem => 1_000_000_000 });
+ $static->add_service_usage_to_node("A", "a", { maxcpu => 4, maxmem => 1_000_000_000 });
+ $static->add_service_usage_to_node("B", "b", { maxcpu => 2, maxmem => 1_000_000_000 });
+ $static->add_service_usage_to_node("C", "c", { maxcpu => 2, maxmem => 1_000_000_000 });
}
my $service = {
@@ -106,6 +168,7 @@ sub test_balance_small_memory_difference {
maxmem => 16_000_000,
};
+ my @sid_names = ("d" .. "z");
for (my $i = 0; $i < 20; $i++) {
my $score_list = $static->score_nodes_to_start_service($service);
@@ -131,12 +194,13 @@ sub test_balance_small_memory_difference {
die "internal error, got $i % 4 == " . ($i % 4) . "\n";
}
- $static->add_service_usage_to_node($nodes[0], $service);
+ $static->add_service_usage_to_node($nodes[0], $sid_names[$i], $service);
}
}
test_basic();
test_balance();
+test_balance_removal();
test_overcommitted();
test_balance_small_memory_difference(1);
test_balance_small_memory_difference(0);
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (2 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-16 11:12 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery Daniel Kral
` (8 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
As the HA Manager builds the static load scheduler, it queries the
services' static usage by reading and parsing the static guest configs
individually, which can take significant time with respect to how many
times recompute_online_node_usage(...) is called.
PVE::Cluster exposes an efficient interface to gather a set of
properties from one or all guest configs [0]. This is used here to build
a rather short-lived cache on every (re)initialization of the static
load scheduler.
The downside to this approach is if there are way more non-HA managed
guests in the cluster than HA-managed guests, which causes this to load
much more information than necessary. It also doesn't cache the default
values for the environment-specific static service usage, which causes
quite a bottleneck as well.
[0] pve-cluster cf1b19d (add get_guest_config_property IPCC method)
Suggested-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
If the above mentioned downside is too large for some setups, we could
also just add a switch to enable/disable the static stats cache or
automatically figure out if it brings any benefits.. But I think this
will be good enough, especially with the latter patches making way fewer
calls to get_service_usage(...).
src/PVE/HA/Env.pm | 12 ++++++++++++
src/PVE/HA/Env/PVE2.pm | 21 +++++++++++++++++++++
src/PVE/HA/Manager.pm | 1 +
src/PVE/HA/Resources/PVECT.pm | 3 ++-
src/PVE/HA/Resources/PVEVM.pm | 3 ++-
src/PVE/HA/Sim/Env.pm | 12 ++++++++++++
src/PVE/HA/Sim/Hardware.pm | 31 +++++++++++++++++++++----------
src/PVE/HA/Sim/Resources.pm | 4 ++--
8 files changed, 73 insertions(+), 14 deletions(-)
diff --git a/src/PVE/HA/Env.pm b/src/PVE/HA/Env.pm
index e00272a0..4282d33f 100644
--- a/src/PVE/HA/Env.pm
+++ b/src/PVE/HA/Env.pm
@@ -300,6 +300,18 @@ sub get_datacenter_settings {
return $self->{plug}->get_datacenter_settings();
}
+sub get_static_service_stats {
+ my ($self, $id) = @_;
+
+ return $self->{plug}->get_static_service_stats($id);
+}
+
+sub update_static_service_stats {
+ my ($self) = @_;
+
+ return $self->{plug}->update_static_service_stats();
+}
+
sub get_static_node_stats {
my ($self) = @_;
diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm
index e76e86b8..e1d752f7 100644
--- a/src/PVE/HA/Env/PVE2.pm
+++ b/src/PVE/HA/Env/PVE2.pm
@@ -49,6 +49,8 @@ sub new {
$self->{nodename} = $nodename;
+ $self->{static_service_stats} = undef;
+
return $self;
}
@@ -497,6 +499,25 @@ sub get_datacenter_settings {
};
}
+sub get_static_service_stats {
+ my ($self, $id) = @_;
+
+ # undef if update_static_service_stats(...) failed before
+ return undef if !defined($self->{static_service_stats});
+
+ return $self->{static_service_stats}->{$id} // {};
+}
+
+sub update_static_service_stats {
+ my ($self) = @_;
+
+ my $properties = ['cores', 'cpulimit', 'memory', 'sockets', 'vcpus'];
+ my $stats = eval { PVE::Cluster::get_guest_config_properties($properties) };
+ $self->log('warning', "unable to update static service stats cache - $@") if $@;
+
+ $self->{static_service_stats} = $stats;
+}
+
sub get_static_node_stats {
my ($self) = @_;
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index ba59f642..3f81f233 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -247,6 +247,7 @@ sub recompute_online_node_usage {
$online_node_usage = eval {
my $scheduler = PVE::HA::Usage::Static->new($haenv);
$scheduler->add_node($_) for $online_nodes->@*;
+ $haenv->update_static_service_stats();
return $scheduler;
};
$haenv->log(
diff --git a/src/PVE/HA/Resources/PVECT.pm b/src/PVE/HA/Resources/PVECT.pm
index 44644d92..091249d7 100644
--- a/src/PVE/HA/Resources/PVECT.pm
+++ b/src/PVE/HA/Resources/PVECT.pm
@@ -156,7 +156,8 @@ sub remove_locks {
sub get_static_stats {
my ($class, $haenv, $id, $service_node) = @_;
- my $conf = PVE::LXC::Config->load_config($id, $service_node);
+ my $conf = $haenv->get_static_service_stats($id);
+ $conf = PVE::LXC::Config->load_config($id, $service_node) if !defined($conf);
return {
maxcpu => PVE::LXC::Config->get_derived_property($conf, 'max-cpu'),
diff --git a/src/PVE/HA/Resources/PVEVM.pm b/src/PVE/HA/Resources/PVEVM.pm
index e634fe3c..d1bc3329 100644
--- a/src/PVE/HA/Resources/PVEVM.pm
+++ b/src/PVE/HA/Resources/PVEVM.pm
@@ -177,7 +177,8 @@ sub remove_locks {
sub get_static_stats {
my ($class, $haenv, $id, $service_node) = @_;
- my $conf = PVE::QemuConfig->load_config($id, $service_node);
+ my $conf = $haenv->get_static_service_stats($id);
+ $conf = PVE::QemuConfig->load_config($id, $service_node) if !defined($conf);
return {
maxcpu => PVE::QemuConfig->get_derived_property($conf, 'max-cpu'),
diff --git a/src/PVE/HA/Sim/Env.pm b/src/PVE/HA/Sim/Env.pm
index 684e92f8..1d70026e 100644
--- a/src/PVE/HA/Sim/Env.pm
+++ b/src/PVE/HA/Sim/Env.pm
@@ -488,6 +488,18 @@ sub get_datacenter_settings {
};
}
+sub get_static_service_stats {
+ my ($self, $id) = @_;
+
+ return $self->{hardware}->get_static_service_stats($id);
+}
+
+sub update_static_service_stats {
+ my ($self) = @_;
+
+ return $self->{hardware}->update_static_service_stats();
+}
+
sub get_static_node_stats {
my ($self) = @_;
diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
index 9e8c7995..fae27b2a 100644
--- a/src/PVE/HA/Sim/Hardware.pm
+++ b/src/PVE/HA/Sim/Hardware.pm
@@ -387,16 +387,6 @@ sub write_service_status {
return $res;
}
-sub read_static_service_stats {
- my ($self) = @_;
-
- my $filename = "$self->{statusdir}/static_service_stats";
- my $stats = eval { PVE::HA::Tools::read_json_from_file($filename) };
- $self->log('error', "loading static service stats failed - $@") if $@;
-
- return $stats;
-}
-
sub new {
my ($this, $testdir) = @_;
@@ -477,6 +467,8 @@ sub new {
$self->{service_config} = $self->read_service_config();
+ $self->{static_service_stats} = undef;
+
return $self;
}
@@ -943,6 +935,25 @@ sub watchdog_update {
return &$modify_watchog($self, $code);
}
+sub get_static_service_stats {
+ my ($self, $id) = @_;
+
+ # undef if update_static_service_stats(...) failed before
+ return undef if !defined($self->{static_service_stats});
+
+ return $self->{static_service_stats}->{$id} // {};
+}
+
+sub update_static_service_stats {
+ my ($self) = @_;
+
+ my $filename = "$self->{statusdir}/static_service_stats";
+ my $stats = eval { PVE::HA::Tools::read_json_from_file($filename) };
+ $self->log('warning', "unable to update static service stats cache - $@") if $@;
+
+ $self->{static_service_stats} = $stats;
+}
+
sub get_static_node_stats {
my ($self) = @_;
diff --git a/src/PVE/HA/Sim/Resources.pm b/src/PVE/HA/Sim/Resources.pm
index 72623ee1..7641b1a9 100644
--- a/src/PVE/HA/Sim/Resources.pm
+++ b/src/PVE/HA/Sim/Resources.pm
@@ -143,8 +143,8 @@ sub get_static_stats {
my $sid = $class->type() . ":$id";
my $hardware = $haenv->hardware();
- my $stats = $hardware->read_static_service_stats();
- if (my $service_stats = $stats->{$sid}) {
+ my $service_stats = $hardware->get_static_service_stats($sid);
+ if (%$service_stats) {
return $service_stats;
} elsif ($id =~ /^(\d)(\d\d)/) {
# auto assign usage calculated from ID for convenience
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (3 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-16 11:25 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node " Daniel Kral
` (7 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
recompute_online_node_usage(...) is currently only dependent on the
configured CRS scheduler mode, the list of online nodes and the HA
resource's state, current node and optional migration target.
The proper recovery of HA resources was introduced in 9da84a0d ("fix
'change_service_location' misuse and recovery from fencing") as a
private helper to recover fenced HA resources and the
recompute_online_node_usage(...) was needed here.
As the recovery of fenced HA resources is its own state for HA resources
since c259b1a8 ("manager: make recovery actual state in FSM"), the
change_service_state(...) to 'recovery' will already call
recompute_online_node_usage(...) before, making this call redundant.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 2 --
1 file changed, 2 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 3f81f233..f69a7fe9 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1301,8 +1301,6 @@ sub next_state_recovery {
my $fenced_node = $sd->{node}; # for logging purpose
- $self->recompute_online_node_usage(); # we want the most current node state
-
my $recovery_node = select_service_node(
$self->{rules},
$self->{online_node_usage},
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node from next_state_recovery
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (4 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-16 11:33 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started Daniel Kral
` (6 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
Since 5c2eef4b ("account service to source and target during move") a
moving HA resource's load is accounted for on the source and target
nodes.
A HA resource is in the 'recovery' state after a successfully fenced
node and its primary goal becomes to find a recovery node. As soon as a
recovery node is found, the HA resource is moved there. As the previous
node was fenced, there is only load on the recovery node.
The add_service_usage_to_node(...) is redundant at this point as the
subsequent call to change_service_state(...) to either the 'started' or
'request_stop' state will call recompute_online_node_usage(...) and make
the changes to $online_node_usage be discarded immediately.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 2 --
1 file changed, 2 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index f69a7fe9..6afc9250 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1321,8 +1321,6 @@ sub next_state_recovery {
$fence_recovery_cleanup->($self, $sid, $fenced_node);
$haenv->steal_service($sid, $sd->{node}, $recovery_node);
- $self->{online_node_usage}->add_service_usage_to_node($recovery_node, $sid, $recovery_node);
- $self->{online_node_usage}->add_service_node($sid, $recovery_node);
# NOTE: $sd *is normally read-only*, fencing is the exception
$cd->{node} = $sd->{node} = $recovery_node;
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (5 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node " Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-16 11:39 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class Daniel Kral
` (5 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
Since 5c2eef4b ("account service to source and target during move") a
moving HA resource's load is accounted for on the source and target
nodes.
A HA resource in the 'started' state, which is not configured otherwise
or has no pending CRM commands left to process, actively checks whether
there is "better" node placement by querying select_service_node(...).
When a better node is found, the HA resource will be migrated or
relocated to the found node depending on their type.
The add_service_usage_to_node(...) is redundant at this point as the
subsequent call to change_service_state(...) to either the 'migrate' or
'relocate' state will call recompute_online_node_usage(...) and make
the changes to $online_node_usage be discarded immediately.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 3 ---
1 file changed, 3 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 6afc9250..468e41eb 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1194,9 +1194,6 @@ sub next_state_started {
);
if ($node && ($sd->{node} ne $node)) {
- $self->{online_node_usage}->add_service_usage_to_node($node, $sid, $sd->{node});
- $self->{online_node_usage}->add_service_node($sid, $node);
-
if (defined(my $fallback = $sd->{maintenance_node})) {
if ($node eq $fallback) {
$haenv->log(
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (6 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-17 11:14 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper Daniel Kral
` (4 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
The resource affinity rules need information about the other HA
resource's used nodes to be enacted correctly, which has been proxied
through $online_node_usage before.
The get_used_service_nodes(...) reflects the same logic to retrieve the
nodes, where a HA resource $sid currently puts load on, as in
recompute_online_node_usage(...).
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
I wanted to put this information in PVE::HA::Usage directly, but figured
it would make Usage much more dependent on $ss / $sd.. We can still do
that later on or move the helper elsewhere, e.g. making ServiceStatus
its own class?
src/PVE/HA/Manager.pm | 24 +++++++++++------------
src/PVE/HA/Rules/ResourceAffinity.pm | 15 ++++++++------
src/PVE/HA/Tools.pm | 29 ++++++++++++++++++++++++++++
src/PVE/HA/Usage.pm | 18 -----------------
src/PVE/HA/Usage/Basic.pm | 19 ------------------
src/PVE/HA/Usage/Static.pm | 19 ------------------
src/test/test_failover1.pl | 17 +++++++++-------
7 files changed, 59 insertions(+), 82 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 468e41eb..0226e427 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -121,12 +121,12 @@ sub flush_master_status {
=head3 select_service_node(...)
-=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $sd, $node_preference)
+=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $ss, $node_preference)
Used to select the best fitting node for the service C<$sid>, with the
-configuration C<$service_conf> and state C<$sd>, according to the rules defined
-in C<$rules>, available node utilization in C<$online_node_usage>, and the
-given C<$node_preference>.
+configuration C<$service_conf>, according to the rules defined in C<$rules>,
+available node utilization in C<$online_node_usage>, the service states in
+C<$ss> and the given C<$node_preference>.
The C<$node_preference> can be set to:
@@ -143,11 +143,12 @@ The C<$node_preference> can be set to:
=cut
sub select_service_node {
- my ($rules, $online_node_usage, $sid, $service_conf, $sd, $node_preference) = @_;
+ my ($rules, $online_node_usage, $sid, $service_conf, $ss, $node_preference) = @_;
die "'$node_preference' is not a valid node_preference for select_service_node\n"
if $node_preference !~ m/(none|best-score|try-next)/;
+ my $sd = $ss->{$sid};
my ($current_node, $tried_nodes, $maintenance_fallback) =
$sd->@{qw(node failed_nodes maintenance_node)};
@@ -155,7 +156,8 @@ sub select_service_node {
return undef if !%$pri_nodes;
- my ($together, $separate) = get_resource_affinity($rules, $sid, $online_node_usage);
+ my $online_nodes = { map { $_ => 1 } $online_node_usage->list_nodes() };
+ my ($together, $separate) = get_resource_affinity($rules, $sid, $ss, $online_nodes);
# stay on current node if possible (avoids random migrations)
if (
@@ -281,7 +283,6 @@ sub recompute_online_node_usage {
|| $state eq 'recovery'
) {
$online_node_usage->add_service_usage_to_node($sd->{node}, $sid, $sd->{node});
- $online_node_usage->set_service_node($sid, $sd->{node});
} elsif (
$state eq 'migrate'
|| $state eq 'relocate'
@@ -291,11 +292,9 @@ sub recompute_online_node_usage {
# count it for both, source and target as load is put on both
if ($state ne 'request_start_balance') {
$online_node_usage->add_service_usage_to_node($source, $sid, $source, $target);
- $online_node_usage->add_service_node($sid, $source);
}
if ($online_node_usage->contains_node($target)) {
$online_node_usage->add_service_usage_to_node($target, $sid, $source, $target);
- $online_node_usage->add_service_node($sid, $target);
}
} elsif ($state eq 'stopped' || $state eq 'request_start') {
# do nothing
@@ -308,7 +307,6 @@ sub recompute_online_node_usage {
# case a node dies, as we cannot really know if the to-be-aborted incoming migration
# has already cleaned up all used resources
$online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $target);
- $online_node_usage->set_service_node($sid, $target);
}
}
}
@@ -1022,7 +1020,7 @@ sub next_state_request_start {
$self->{online_node_usage},
$sid,
$cd,
- $sd,
+ $self->{ss},
'best-score',
);
my $select_text = $selected_node ne $current_node ? 'new' : 'current';
@@ -1189,7 +1187,7 @@ sub next_state_started {
$self->{online_node_usage},
$sid,
$cd,
- $sd,
+ $self->{ss},
$select_node_preference,
);
@@ -1303,7 +1301,7 @@ sub next_state_recovery {
$self->{online_node_usage},
$sid,
$cd,
- $sd,
+ $self->{ss},
'best-score',
);
diff --git a/src/PVE/HA/Rules/ResourceAffinity.pm b/src/PVE/HA/Rules/ResourceAffinity.pm
index 9bc039ba..968aa088 100644
--- a/src/PVE/HA/Rules/ResourceAffinity.pm
+++ b/src/PVE/HA/Rules/ResourceAffinity.pm
@@ -5,6 +5,7 @@ use warnings;
use PVE::HA::HashTools qw(set_intersect sets_are_disjoint);
use PVE::HA::Rules;
+use PVE::HA::Tools;
use base qw(Exporter);
use base qw(PVE::HA::Rules);
@@ -496,12 +497,12 @@ sub get_affinitive_resources : prototype($$) {
return ($together, $separate);
}
-=head3 get_resource_affinity($rules, $sid, $online_node_usage)
+=head3 get_resource_affinity($rules, $sid, $ss, $online_nodes)
Returns a list of two hashes, where the first describes the positive resource
affinity and the second hash describes the negative resource affinity for
-resource C<$sid> according to the resource affinity rules in C<$rules> and the
-resource locations in C<$online_node_usage>.
+resource C<$sid> according to the resource affinity rules in C<$rules>, the
+service status C<$ss> and the C<$online_nodes> hash.
For the positive resource affinity of a resource C<$sid>, each element in the
hash represents an online node, where other resources, which C<$sid> is in
@@ -529,8 +530,8 @@ resource C<$sid> is in a negative affinity with, the returned value will be:
=cut
-sub get_resource_affinity : prototype($$$) {
- my ($rules, $sid, $online_node_usage) = @_;
+sub get_resource_affinity : prototype($$$$) {
+ my ($rules, $sid, $ss, $online_nodes) = @_;
my $together = {};
my $separate = {};
@@ -543,7 +544,9 @@ sub get_resource_affinity : prototype($$$) {
for my $csid (keys %{ $rule->{resources} }) {
next if $csid eq $sid;
- my $nodes = $online_node_usage->get_service_nodes($csid);
+ my $used_nodes =
+ PVE::HA::Tools::get_used_service_nodes($ss->{$csid}, $online_nodes);
+ my $nodes = [values %$used_nodes];
next if !$nodes || !@$nodes; # skip unassigned nodes
diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
index 71eb5d0b..7f718e25 100644
--- a/src/PVE/HA/Tools.pm
+++ b/src/PVE/HA/Tools.pm
@@ -188,6 +188,35 @@ sub count_fenced_services {
return $count;
}
+sub get_used_service_nodes {
+ my ($sd, $online_nodes) = @_;
+
+ my $result = {};
+
+ my ($state, $node, $target) = $sd->@{qw(state node target)};
+
+ return $result if $state eq 'stopped' || $state eq 'request_start';
+
+ if (
+ $state eq 'started'
+ || $state eq 'request_stop'
+ || $state eq 'fence'
+ || $state eq 'freeze'
+ || $state eq 'error'
+ || $state eq 'recovery'
+ || $state eq 'migrate'
+ || $state eq 'relocate'
+ ) {
+ $result->{current} = $node if $online_nodes->{$node};
+ }
+
+ if ($state eq 'migrate' || $state eq 'relocate' || $state eq 'request_start_balance') {
+ $result->{target} = $target if defined($target) && $online_nodes->{$target};
+ }
+
+ return $result;
+}
+
sub get_verbose_service_state {
my ($service_state, $service_conf) = @_;
diff --git a/src/PVE/HA/Usage.pm b/src/PVE/HA/Usage.pm
index 7f4d9ca3..66d9572e 100644
--- a/src/PVE/HA/Usage.pm
+++ b/src/PVE/HA/Usage.pm
@@ -27,24 +27,6 @@ sub list_nodes {
die "implement in subclass";
}
-sub get_service_nodes {
- my ($self, $sid) = @_;
-
- die "implement in subclass";
-}
-
-sub set_service_node {
- my ($self, $sid, $nodename) = @_;
-
- die "implement in subclass";
-}
-
-sub add_service_node {
- my ($self, $sid, $nodename) = @_;
-
- die "implement in subclass";
-}
-
sub contains_node {
my ($self, $nodename) = @_;
diff --git a/src/PVE/HA/Usage/Basic.pm b/src/PVE/HA/Usage/Basic.pm
index afe3733c..ead08c54 100644
--- a/src/PVE/HA/Usage/Basic.pm
+++ b/src/PVE/HA/Usage/Basic.pm
@@ -11,7 +11,6 @@ sub new {
return bless {
nodes => {},
haenv => $haenv,
- 'service-nodes' => {},
}, $class;
}
@@ -39,24 +38,6 @@ sub contains_node {
return defined($self->{nodes}->{$nodename});
}
-sub get_service_nodes {
- my ($self, $sid) = @_;
-
- return $self->{'service-nodes'}->{$sid};
-}
-
-sub set_service_node {
- my ($self, $sid, $nodename) = @_;
-
- $self->{'service-nodes'}->{$sid} = [$nodename];
-}
-
-sub add_service_node {
- my ($self, $sid, $nodename) = @_;
-
- push @{ $self->{'service-nodes'}->{$sid} }, $nodename;
-}
-
sub add_service_usage_to_node {
my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
diff --git a/src/PVE/HA/Usage/Static.pm b/src/PVE/HA/Usage/Static.pm
index 6707a54c..061e74a2 100644
--- a/src/PVE/HA/Usage/Static.pm
+++ b/src/PVE/HA/Usage/Static.pm
@@ -22,7 +22,6 @@ sub new {
'service-stats' => {},
haenv => $haenv,
scheduler => $scheduler,
- 'service-nodes' => {},
'service-counts' => {}, # Service count on each node. Fallback if scoring calculation fails.
}, $class;
}
@@ -87,24 +86,6 @@ my sub get_service_usage {
return $service_stats;
}
-sub get_service_nodes {
- my ($self, $sid) = @_;
-
- return $self->{'service-nodes'}->{$sid};
-}
-
-sub set_service_node {
- my ($self, $sid, $nodename) = @_;
-
- $self->{'service-nodes'}->{$sid} = [$nodename];
-}
-
-sub add_service_node {
- my ($self, $sid, $nodename) = @_;
-
- push @{ $self->{'service-nodes'}->{$sid} }, $nodename;
-}
-
sub add_service_usage_to_node {
my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
diff --git a/src/test/test_failover1.pl b/src/test/test_failover1.pl
index 78a001eb..495d4b4b 100755
--- a/src/test/test_failover1.pl
+++ b/src/test/test_failover1.pl
@@ -14,9 +14,10 @@ PVE::HA::Rules::NodeAffinity->register();
PVE::HA::Rules->init(property_isolation => 1);
+my $sid = 'vm:111';
my $rules = PVE::HA::Rules->parse_config("rules.tmp", <<EOD);
node-affinity: prefer_node1
- resources vm:111
+ resources $sid
nodes node1
EOD
@@ -31,10 +32,12 @@ my $service_conf = {
failback => 1,
};
-my $sd = {
- node => $service_conf->{node},
- failed_nodes => undef,
- maintenance_node => undef,
+my $ss = {
+ "$sid" => {
+ node => $service_conf->{node},
+ failed_nodes => undef,
+ maintenance_node => undef,
+ },
};
sub test {
@@ -43,14 +46,14 @@ sub test {
my $select_node_preference = $try_next ? 'try-next' : 'none';
my $node = PVE::HA::Manager::select_service_node(
- $rules, $online_node_usage, "vm:111", $service_conf, $sd, $select_node_preference,
+ $rules, $online_node_usage, "$sid", $service_conf, $ss, $select_node_preference,
);
my (undef, undef, $line) = caller();
die "unexpected result: $node != ${expected_node} at line $line\n"
if $node ne $expected_node;
- $sd->{node} = $node;
+ $ss->{$sid}->{node} = $node;
}
test('node1');
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (7 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-17 11:25 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations Daniel Kral
` (3 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
As the previously introduced PVE::HA::Tools::get_used_service_nodes(...)
helper reflects the same logic as in recompute_online_node_usage(...),
use the helper instead to reduce code duplication.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 53 +++++++++----------------------------------
1 file changed, 11 insertions(+), 42 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 0226e427..d0d4d0a5 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -240,7 +240,7 @@ sub recompute_online_node_usage {
my $haenv = $self->{haenv};
- my $online_nodes = $self->{ns}->list_online_nodes();
+ my $online_nodes = { map { $_ => 1 } $self->{ns}->list_online_nodes()->@* };
my $online_node_usage;
@@ -248,7 +248,7 @@ sub recompute_online_node_usage {
if ($mode eq 'static') {
$online_node_usage = eval {
my $scheduler = PVE::HA::Usage::Static->new($haenv);
- $scheduler->add_node($_) for $online_nodes->@*;
+ $scheduler->add_node($_) for keys $online_nodes->%*;
$haenv->update_static_service_stats();
return $scheduler;
};
@@ -266,49 +266,18 @@ sub recompute_online_node_usage {
# fallback to the basic algorithm in any case
if (!$online_node_usage) {
$online_node_usage = PVE::HA::Usage::Basic->new($haenv);
- $online_node_usage->add_node($_) for $online_nodes->@*;
+ $online_node_usage->add_node($_) for keys $online_nodes->%*;
}
- foreach my $sid (sort keys %{ $self->{ss} }) {
+ for my $sid (sort keys $self->{ss}->%*) {
my $sd = $self->{ss}->{$sid};
- my $state = $sd->{state};
- my $target = $sd->{target}; # optional
- if ($online_node_usage->contains_node($sd->{node})) {
- if (
- $state eq 'started'
- || $state eq 'request_stop'
- || $state eq 'fence'
- || $state eq 'freeze'
- || $state eq 'error'
- || $state eq 'recovery'
- ) {
- $online_node_usage->add_service_usage_to_node($sd->{node}, $sid, $sd->{node});
- } elsif (
- $state eq 'migrate'
- || $state eq 'relocate'
- || $state eq 'request_start_balance'
- ) {
- my $source = $sd->{node};
- # count it for both, source and target as load is put on both
- if ($state ne 'request_start_balance') {
- $online_node_usage->add_service_usage_to_node($source, $sid, $source, $target);
- }
- if ($online_node_usage->contains_node($target)) {
- $online_node_usage->add_service_usage_to_node($target, $sid, $source, $target);
- }
- } elsif ($state eq 'stopped' || $state eq 'request_start') {
- # do nothing
- } else {
- die "should not be reached (sid = '$sid', state = '$state')";
- }
- } elsif (defined($target) && $online_node_usage->contains_node($target)) {
- if ($state eq 'migrate' || $state eq 'relocate') {
- # to correctly track maintenance modi and also consider the target as used for the
- # case a node dies, as we cannot really know if the to-be-aborted incoming migration
- # has already cleaned up all used resources
- $online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $target);
- }
- }
+ my $used_nodes = PVE::HA::Tools::get_used_service_nodes($sd, $online_nodes);
+ my ($current, $target) = $used_nodes->@{qw(current target)};
+
+ $online_node_usage->add_service_usage_to_node($current, $sid, $sd->{node}, $sd->{target})
+ if $current;
+ $online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $sd->{target})
+ if $target;
}
$self->{online_node_usage} = $online_node_usage;
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (8 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-17 11:57 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular Daniel Kral
` (2 subsequent siblings)
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
This makes use of the new signature for add_service_usage_to_node(...)
from the Proxmox::RS::ResourceScheduling::Static package, which allows
tracking of which HA resources have been assigned to which nodes.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Usage.pm | 6 ++++++
src/PVE/HA/Usage/Basic.pm | 14 +++++++++++---
src/PVE/HA/Usage/Static.pm | 26 ++++++++++++++++++++------
3 files changed, 37 insertions(+), 9 deletions(-)
diff --git a/src/PVE/HA/Usage.pm b/src/PVE/HA/Usage.pm
index 66d9572e..f19fae76 100644
--- a/src/PVE/HA/Usage.pm
+++ b/src/PVE/HA/Usage.pm
@@ -40,6 +40,12 @@ sub add_service_usage_to_node {
die "implement in subclass";
}
+sub remove_service_usage {
+ my ($self, $sid) = @_;
+
+ die "implement in subclass";
+}
+
# Returns a hash with $nodename => $score pairs. A lower $score is better.
sub score_nodes_to_start_service {
my ($self, $sid, $service_node) = @_;
diff --git a/src/PVE/HA/Usage/Basic.pm b/src/PVE/HA/Usage/Basic.pm
index ead08c54..9b80661a 100644
--- a/src/PVE/HA/Usage/Basic.pm
+++ b/src/PVE/HA/Usage/Basic.pm
@@ -17,7 +17,7 @@ sub new {
sub add_node {
my ($self, $nodename) = @_;
- $self->{nodes}->{$nodename} = 0;
+ $self->{nodes}->{$nodename} = {};
}
sub remove_node {
@@ -42,7 +42,7 @@ sub add_service_usage_to_node {
my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
if ($self->contains_node($nodename)) {
- $self->{nodes}->{$nodename}++;
+ $self->{nodes}->{$nodename}->{$sid} = 1;
} else {
$self->{haenv}->log(
'warning',
@@ -51,10 +51,18 @@ sub add_service_usage_to_node {
}
}
+sub remove_service_usage {
+ my ($self, $sid) = @_;
+
+ for my $node ($self->list_nodes()) {
+ delete $self->{nodes}->{$node}->{$sid};
+ }
+}
+
sub score_nodes_to_start_service {
my ($self, $sid, $service_node) = @_;
- return $self->{nodes};
+ return { map { $_ => scalar keys $self->{nodes}->{$_}->%* } keys $self->{nodes}->%* };
}
1;
diff --git a/src/PVE/HA/Usage/Static.pm b/src/PVE/HA/Usage/Static.pm
index 061e74a2..2867d4ff 100644
--- a/src/PVE/HA/Usage/Static.pm
+++ b/src/PVE/HA/Usage/Static.pm
@@ -22,14 +22,14 @@ sub new {
'service-stats' => {},
haenv => $haenv,
scheduler => $scheduler,
- 'service-counts' => {}, # Service count on each node. Fallback if scoring calculation fails.
+ 'node-services' => {}, # Services on each node. Fallback if scoring calculation fails.
}, $class;
}
sub add_node {
my ($self, $nodename) = @_;
- $self->{'service-counts'}->{$nodename} = 0;
+ $self->{'node-services'}->{$nodename} = {};
my $stats = $self->{'node-stats'}->{$nodename}
or die "did not get static node usage information for '$nodename'\n";
@@ -43,7 +43,7 @@ sub add_node {
sub remove_node {
my ($self, $nodename) = @_;
- delete $self->{'service-counts'}->{$nodename};
+ delete $self->{'node-services'}->{$nodename};
$self->{scheduler}->remove_node($nodename);
}
@@ -89,16 +89,27 @@ my sub get_service_usage {
sub add_service_usage_to_node {
my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
- $self->{'service-counts'}->{$nodename}++;
+ $self->{'node-services'}->{$nodename}->{$sid} = 1;
eval {
my $service_usage = get_service_usage($self, $sid, $service_node, $migration_target);
- $self->{scheduler}->add_service_usage_to_node($nodename, $service_usage);
+ $self->{scheduler}->add_service_usage_to_node($nodename, $sid, $service_usage);
};
$self->{haenv}->log('warning', "unable to add service '$sid' usage to node '$nodename' - $@")
if $@;
}
+sub remove_service_usage {
+ my ($self, $sid) = @_;
+
+ delete $self->{'node-services'}->{$_}->{$sid} for $self->list_nodes();
+
+ eval { $self->{scheduler}->remove_service_usage($sid) };
+ $self->{haenv}->log('warning', "unable to remove service '$sid' usage - $@") if $@;
+
+ delete $self->{'service-stats'}->{$sid}; # Invalidate old service stats
+}
+
sub score_nodes_to_start_service {
my ($self, $sid, $service_node) = @_;
@@ -111,7 +122,10 @@ sub score_nodes_to_start_service {
'err',
"unable to score nodes according to static usage for service '$sid' - $err",
);
- return $self->{'service-counts'};
+ return {
+ map { $_ => scalar keys $self->{'node-services'}->{$_}->%* }
+ keys $self->{'node-services'}->%*
+ };
}
# Take minus the value, so that a lower score is better, which our caller(s) expect(s).
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (9 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-17 12:32 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular Daniel Kral
2025-10-20 16:50 ` [pve-devel] superseded: [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
The HA Manager builds $online_node_usage in every FSM iteration in
manage(...) and at every HA resource state change in
change_service_state(...). This becomes quite costly with a high HA
resource count and a lot of state changes happening at once, e.g.
starting up multiple nodes with rebalance_on_request_start set or a
failover of a node with many configured HA resources.
To improve this situation, make the changes to the $online_node_usage
more granular by building $online_node_usage only once per call to
manage(...) and changing the nodes a HA resource uses individually on
every HA resource state transition.
The change in service usage "freshness" should be negligible here as the
static service usage data is cached anyway (except if the cache fails
for some reason).
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
The add_service_usage(...) helper is added in anticipation for the next
patch, we don't need a helper if we don't go for #9.
src/PVE/HA/Manager.pm | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index d0d4d0a5..253deba9 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -77,6 +77,21 @@ sub new {
return $self;
}
+sub add_service_usage {
+ my ($self, $sid, $sd) = @_;
+
+ my $online_node_usage = $self->{online_node_usage};
+
+ my $online_nodes = { map { $_ => 1 } $online_node_usage->list_nodes() };
+ my $nodes = PVE::HA::Tools::get_used_service_nodes($sd, $online_nodes);
+
+ my ($current, $target) = $nodes->@{qw(current target)};
+ $online_node_usage->add_service_usage_to_node($current, $sid, $sd->{node}, $sd->{target})
+ if $current;
+ $online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $sd->{target})
+ if $target;
+}
+
sub update_crs_scheduler_mode {
my ($self) = @_;
@@ -314,7 +329,8 @@ my $change_service_state = sub {
$sd->{$k} = $v;
}
- $self->recompute_online_node_usage();
+ $self->{online_node_usage}->remove_service_usage($sid);
+ $self->add_service_usage($sid, $sd);
$sd->{uid} = compute_new_uuid($new_state);
@@ -706,6 +722,8 @@ sub manage {
delete $ss->{$sid};
}
+ $self->recompute_online_node_usage();
+
my $new_rules = $haenv->read_rules_config();
# TODO PVE 10: Remove group migration when HA groups have been fully migrated to rules
@@ -735,8 +753,6 @@ sub manage {
for (;;) {
my $repeat = 0;
- $self->recompute_online_node_usage();
-
foreach my $sid (sort keys %$ss) {
my $sd = $ss->{$sid};
my $cd = $sc->{$sid} || { state => 'disabled' };
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (10 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular Daniel Kral
@ 2025-09-30 14:19 ` Daniel Kral
2025-10-17 12:42 ` Fiona Ebner
2025-10-20 16:50 ` [pve-devel] superseded: [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
12 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-09-30 14:19 UTC (permalink / raw)
To: pve-devel
The $online_node_usage is built on every call to manage(...) now, but
can be reduced to only be built on any scheduler mode change (including
initialization or error path to be complete).
This allows recompute_online_node_usage(...) to be reduced to
adding/removing nodes whenever these become online or are not online
anymore and handle the service usage updates whenever these change.
Therefore, recompute_online_node_usage(...) must only be called once in
manage(...) after $ns was properly updated.
Note that this makes the ha-manager not acknowledge any hotplug changes
to the guest configs anymore as long as the HA resource state doesn't
change.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
If we go for this patch, then we would need some mechanism to update the
static usage for a single or all HA resources registered in
$online_node_usage at once (or just rebuilt $online_node_usage at that
point..).
src/PVE/HA/Manager.pm | 90 +++++++++++++++++++++++--------------------
1 file changed, 49 insertions(+), 41 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 253deba9..6fadb3f3 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -106,6 +106,7 @@ sub update_crs_scheduler_mode {
if (!defined($old_mode)) {
$haenv->log('info', "using scheduler mode '$new_mode'") if $new_mode ne 'basic';
} elsif ($new_mode eq $old_mode) {
+ $haenv->update_static_service_stats() if $old_mode eq 'static';
return; # nothing to do
} else {
$haenv->log('info', "switching scheduler mode from '$old_mode' to '$new_mode'");
@@ -113,6 +114,39 @@ sub update_crs_scheduler_mode {
$self->{crs}->{scheduler} = $new_mode;
+ my $online_node_usage;
+
+ if ($new_mode eq 'static') {
+ $online_node_usage = eval {
+ my $scheduler = PVE::HA::Usage::Static->new($haenv);
+ $scheduler->add_node($_) for $self->{ns}->list_online_nodes()->@*;
+ $haenv->update_static_service_stats();
+ return $scheduler;
+ };
+ if ($@) {
+ $self->{crs}->{scheduler} = 'basic'; # retry on next update
+ $haenv->log(
+ 'warning',
+ "fallback to 'basic' scheduler mode, init for 'static' failed - $@",
+ );
+ }
+ } elsif ($new_mode eq 'basic') {
+ # handled below in the general fall-back case
+ } else {
+ $haenv->log('warning', "got unknown scheduler mode '$new_mode', using 'basic'");
+ }
+
+ # fallback to the basic algorithm in any case
+ if (!$online_node_usage) {
+ $online_node_usage = PVE::HA::Usage::Basic->new($haenv);
+ $online_node_usage->add_node($_) for $self->{ns}->list_online_nodes()->@*;
+ }
+
+ $self->{online_node_usage} = $online_node_usage;
+
+ # initialize with current nodes and services states
+ $self->add_service_usage($_, $self->{ss}->{$_}) for keys $self->{ss}->%*;
+
return;
}
@@ -253,49 +287,19 @@ my $valid_service_states = {
sub recompute_online_node_usage {
my ($self) = @_;
- my $haenv = $self->{haenv};
+ my ($haenv, $ns) = $self->@{qw(haenv ns)};
- my $online_nodes = { map { $_ => 1 } $self->{ns}->list_online_nodes()->@* };
+ for my $node ($self->{online_node_usage}->list_nodes()) {
+ next if $ns->node_is_online($node);
- my $online_node_usage;
-
- if (my $mode = $self->{crs}->{scheduler}) {
- if ($mode eq 'static') {
- $online_node_usage = eval {
- my $scheduler = PVE::HA::Usage::Static->new($haenv);
- $scheduler->add_node($_) for keys $online_nodes->%*;
- $haenv->update_static_service_stats();
- return $scheduler;
- };
- $haenv->log(
- 'warning',
- "fallback to 'basic' scheduler mode, init for 'static' failed - $@",
- ) if $@;
- } elsif ($mode eq 'basic') {
- # handled below in the general fall-back case
- } else {
- $haenv->log('warning', "got unknown scheduler mode '$mode', using 'basic'");
- }
+ $self->{online_node_usage}->remove_node($node);
}
- # fallback to the basic algorithm in any case
- if (!$online_node_usage) {
- $online_node_usage = PVE::HA::Usage::Basic->new($haenv);
- $online_node_usage->add_node($_) for keys $online_nodes->%*;
+ for my $node ($ns->list_online_nodes()->@*) {
+ next if $self->{online_node_usage}->contains_node($node);
+
+ $self->{online_node_usage}->add_node($node);
}
-
- for my $sid (sort keys $self->{ss}->%*) {
- my $sd = $self->{ss}->{$sid};
- my $used_nodes = PVE::HA::Tools::get_used_service_nodes($sd, $online_nodes);
- my ($current, $target) = $used_nodes->@{qw(current target)};
-
- $online_node_usage->add_service_usage_to_node($current, $sid, $sd->{node}, $sd->{target})
- if $current;
- $online_node_usage->add_service_usage_to_node($target, $sid, $sd->{node}, $sd->{target})
- if $target;
- }
-
- $self->{online_node_usage} = $online_node_usage;
}
my $change_service_state = sub {
@@ -693,6 +697,8 @@ sub manage {
$self->{groups} = $haenv->read_group_config(); # update
+ $self->recompute_online_node_usage();
+
# compute new service status
# add new service
@@ -704,11 +710,13 @@ sub manage {
$haenv->log('info', "adding new service '$sid' on node '$cd->{node}'");
# assume we are running to avoid relocate running service at add
my $state = ($cd->{state} eq 'started') ? 'request_start' : 'request_stop';
- $ss->{$sid} = {
+ my $sd = $ss->{$sid} = {
state => $state,
node => $cd->{node},
uid => compute_new_uuid('started'),
};
+
+ $self->add_service_usage($sid, $sd);
}
# remove stale or ignored services from manager state
@@ -718,12 +726,12 @@ sub manage {
my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
$haenv->log('info', "removing stale service '$sid' ($reason)");
+ $self->{online_node_usage}->remove_service_usage($sid);
+
# remove all service related state information
delete $ss->{$sid};
}
- $self->recompute_online_node_usage();
-
my $new_rules = $haenv->read_rules_config();
# TODO PVE 10: Remove group migration when HA groups have been fully migrated to rules
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
@ 2025-10-15 14:31 ` Fiona Ebner
2025-10-16 9:07 ` Daniel Kral
0 siblings, 1 reply; 36+ messages in thread
From: Fiona Ebner @ 2025-10-15 14:31 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
> get_derived_property(...) is called in the semi-hot path of the HA
> Manager's static load scheduler to retrieve the static stats of each VM.
> As the defaults are only needed in certain cases and for a very small
> subset of properties in the VM config, get those separately when needed.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
> get_current_memory(...) is still quite costly here, because it calls
> parse_memory(...), which calls
> PVE::JSONSchema::parse_property_string(...), which adds up for many
> guest configurations parsed in every manage(...) call, but this already
> helps quite a lot.
If this really is a problem, we could do our own parsing, i.e. returning
the value if the property string starts with \d+ or current=\d+ and
falling back to get_current_memory() if it doesn't. Of course also using
the default from $memory_fmt if not set.
>
> src/PVE/QemuConfig.pm | 8 +++-----
> src/PVE/QemuServer.pm | 6 ++++++
> 2 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/src/PVE/QemuConfig.pm b/src/PVE/QemuConfig.pm
> index d0844c4c..078c87e0 100644
> --- a/src/PVE/QemuConfig.pm
> +++ b/src/PVE/QemuConfig.pm
> @@ -582,12 +582,10 @@ sub load_current_config {
We could go a step further and save the three defaults we are interested
in during module load into variables. Then you also save the hash
accesses into $confdesc and $memory_fmt.
> sub get_derived_property {
> my ($class, $conf, $name) = @_;
>
> - my $defaults = PVE::QemuServer::load_defaults();
> -
> if ($name eq 'max-cpu') {
> - my $cpus =
> - ($conf->{sockets} || $defaults->{sockets}) * ($conf->{cores} || $defaults->{cores});
> - return $conf->{vcpus} || $cpus;
> + my $sockets = $conf->{sockets} || PVE::QemuServer::get_default_property_value('sockets');
> + my $cores = $conf->{cores} || PVE::QemuServer::get_default_property_value('cores');
> + return $conf->{vcpus} || ($sockets * $cores);
> } elsif ($name eq 'max-memory') { # current usage maximum, not maximum hotpluggable
> return get_current_memory($conf->{memory}) * 1024 * 1024;
> } else {
Question is, how much do we really need to optimize the function here?
I'm not against it, but just want to note that looking at the static
usage will always have its limitations in practice (independent of
performance). With PSI-based usage, we should only need the static
information for to-be-started or to-be-recovered guests rather than all,
so performance of get_derived_property() becomes much less relevant. Or
what do you think?
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper
2025-10-15 14:31 ` Fiona Ebner
@ 2025-10-16 9:07 ` Daniel Kral
0 siblings, 0 replies; 36+ messages in thread
From: Daniel Kral @ 2025-10-16 9:07 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
On Wed Oct 15, 2025 at 4:31 PM CEST, Fiona Ebner wrote:
> Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
>> get_derived_property(...) is called in the semi-hot path of the HA
>> Manager's static load scheduler to retrieve the static stats of each VM.
>> As the defaults are only needed in certain cases and for a very small
>> subset of properties in the VM config, get those separately when needed.
>>
>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>> ---
>> get_current_memory(...) is still quite costly here, because it calls
>> parse_memory(...), which calls
>> PVE::JSONSchema::parse_property_string(...), which adds up for many
>> guest configurations parsed in every manage(...) call, but this already
>> helps quite a lot.
>
> If this really is a problem, we could do our own parsing, i.e. returning
> the value if the property string starts with \d+ or current=\d+ and
> falling back to get_current_memory() if it doesn't. Of course also using
> the default from $memory_fmt if not set.
>
>>
>> src/PVE/QemuConfig.pm | 8 +++-----
>> src/PVE/QemuServer.pm | 6 ++++++
>> 2 files changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/src/PVE/QemuConfig.pm b/src/PVE/QemuConfig.pm
>> index d0844c4c..078c87e0 100644
>> --- a/src/PVE/QemuConfig.pm
>> +++ b/src/PVE/QemuConfig.pm
>> @@ -582,12 +582,10 @@ sub load_current_config {
>
> We could go a step further and save the three defaults we are interested
> in during module load into variables. Then you also save the hash
> accesses into $confdesc and $memory_fmt.
>
>> sub get_derived_property {
>> my ($class, $conf, $name) = @_;
>>
>> - my $defaults = PVE::QemuServer::load_defaults();
>> -
>> if ($name eq 'max-cpu') {
>> - my $cpus =
>> - ($conf->{sockets} || $defaults->{sockets}) * ($conf->{cores} || $defaults->{cores});
>> - return $conf->{vcpus} || $cpus;
>> + my $sockets = $conf->{sockets} || PVE::QemuServer::get_default_property_value('sockets');
>> + my $cores = $conf->{cores} || PVE::QemuServer::get_default_property_value('cores');
>> + return $conf->{vcpus} || ($sockets * $cores);
>> } elsif ($name eq 'max-memory') { # current usage maximum, not maximum hotpluggable
>> return get_current_memory($conf->{memory}) * 1024 * 1024;
>> } else {
>
> Question is, how much do we really need to optimize the function here?
> I'm not against it, but just want to note that looking at the static
> usage will always have its limitations in practice (independent of
> performance). With PSI-based usage, we should only need the static
> information for to-be-started or to-be-recovered guests rather than all,
> so performance of get_derived_property() becomes much less relevant. Or
> what do you think?
That's a good point, in that regard I definitely favor maintainability
over minor performance improvements here. The (one-time) profile of
9,000 HA resources being balanced on start with all patches applied
(except ha #9) shows the following runtimes:
+---------------------------------------------+------------+------------+
| Function | Excl. time | Incl. time |
+---------------------------------------------+------------+------------+
| PVE::HA::Resources::PVEVM::get_static_stats | 55 ms | 437 ms |
| PVE::QemuConfig::get_derived_property | 40 ms | 363 ms |
| PVE::QemuServer::Memory::get_current_memory | 23 ms | 340 ms |
| PVE::QemuServer::Memory::parse_memory | 22 ms | 317 ms |
| PVE::JSONSchema::parse_property_string | 94 ms | 223 ms |
+---------------------------------------------+------------+------------+
parse_property_string does show up as the 6th highest-exclusive-runtime
function, but my main objective was that that `manage(...)` should
finish under ~10 seconds for ~10,000 HA resources. That should be doable
now, especially as the amount of calls to get_static_usage(...) is now
constant per manage(...) call instead of proportional to the amount of
HA resource state changes.
So let's stay with the current patch unless there are actually people
running into this..
By the way, the highest-exclusive-runtime function in that profile is
Sys::Syslog::syslog now (excluding CORE::sleep of course) ;). But since
that should happen only in very special cases, I don't think that's an
actual performance problem either.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes
2025-09-30 14:19 ` [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes Daniel Kral
@ 2025-10-16 10:32 ` Fiona Ebner
2025-10-16 15:34 ` Daniel Kral
0 siblings, 1 reply; 36+ messages in thread
From: Fiona Ebner @ 2025-10-16 10:32 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
> Implements a simple bidirectional map to track which service usages have
> been added to nodes, so that these can be removed later individually.
>
> The StaticServiceUsage is added to the HashMap<> in StaticNodeInfo to
> reduce unnecessary indirection when summing these values in
> score_nodes_to_start_service(...).
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> I started out adding and removing the service_usage in StaticNodeUsage
> on each add_service_usage_to_node(...) and remove_service_usage(...)
> call individually, but I think summing those up every time is better for
> numerical stability..
This can go into the commit message too if worded a bit less informally ;)
>
> .../bindings/resource_scheduling_static.rs | 98 ++++++++++++++++---
> pve-rs/test/resource_scheduling.pl | 84 ++++++++++++++--
> 2 files changed, 160 insertions(+), 22 deletions(-)
>
> diff --git a/pve-rs/src/bindings/resource_scheduling_static.rs b/pve-rs/src/bindings/resource_scheduling_static.rs
> index 7cf2f35..535dfe3 100644
> --- a/pve-rs/src/bindings/resource_scheduling_static.rs
> +++ b/pve-rs/src/bindings/resource_scheduling_static.rs
> @@ -16,8 +16,16 @@ pub mod pve_rs_resource_scheduling_static {
>
> perlmod::declare_magic!(Box<Scheduler> : &Scheduler as "PVE::RS::ResourceScheduling::Static");
>
> + struct StaticNodeInfo {
> + name: String,
> + maxcpu: usize,
> + maxmem: usize,
> + services: HashMap<String, StaticServiceUsage>,
> + }
> +
> struct Usage {
> - nodes: HashMap<String, StaticNodeUsage>,
> + nodes: HashMap<String, StaticNodeInfo>,
> + service_nodes: HashMap<String, Vec<String>>,
Rather than Vec it could be a HashSet, but I'm fine with either way.
> }
>
> /// A scheduler instance contains the resource usage by node.
---snip---
> @@ -67,10 +75,25 @@ pub mod pve_rs_resource_scheduling_static {
>
> /// Method: Remove a node from the scheduler.
> #[export]
> - pub fn remove_node(#[try_from_ref] this: &Scheduler, nodename: &str) {
> + pub fn remove_node(#[try_from_ref] this: &Scheduler, nodename: &str) -> Result<(), Error> {
> let mut usage = this.inner.lock().unwrap();
>
> - usage.nodes.remove(nodename);
> + if let Some(node) = usage.nodes.remove(nodename) {
> + for (sid, _) in node.services.iter() {
> + match usage.service_nodes.get_mut(sid) {
> + Some(nodes) => {
> + nodes.retain(|node| !node.eq(nodename));
> + }
> + None => bail!(
> + "service '{}' not present while removing node '{}'",
Nit: for a user (or developer in 6 months time :P), this error message
is rather unspecific. I'd suggest:
"internal error: ... not present in service_nodes hash map while ..."
> + sid,
> + nodename
> + ),
> + }
> + }
> + }
> +
> + Ok(())
> }
>
> /// Method: Get a list of all the nodes in the scheduler.
> @@ -93,22 +116,55 @@ pub mod pve_rs_resource_scheduling_static {
> usage.nodes.contains_key(nodename)
> }
>
> - /// Method: Add usage of `service` to the node's usage.
> + /// Method: Add service `sid` and its `service_usage` to the node.
> #[export]
> pub fn add_service_usage_to_node(
> #[try_from_ref] this: &Scheduler,
> nodename: &str,
> - service: StaticServiceUsage,
> + sid: &str,
> + service_usage: StaticServiceUsage,
So this requires a versioned Breaks in d/control. Please mention this in
the commit notes and in the cover letter.
> ) -> Result<(), Error> {
> let mut usage = this.inner.lock().unwrap();
>
> match usage.nodes.get_mut(nodename) {
> Some(node) => {
> - node.add_service_usage(&service);
> - Ok(())
> + if node.services.contains_key(sid) {
> + bail!("service '{}' already added to node '{}'", sid, nodename);
> + }
> +
> + node.services.insert(sid.to_string(), service_usage);
> }
> None => bail!("node '{}' not present in usage hashmap", nodename),
> }
> +
> + if let Some(nodes) = usage.service_nodes.get_mut(sid) {
> + nodes.push(nodename.to_string());
> + } else {
> + usage
> + .service_nodes
> + .insert(sid.to_string(), vec![nodename.to_string()]);
> + }
Nit: just to be sure, we could check whether the node would be duplicate
in service_nodes too (is more straightforward with using HashSet). Maybe
not really worth it though, because if we keep the mappings node ->
services and service -> nodes consistent with each other (which we of
course should) then the check above already protects us.
---snip---
> @@ -50,7 +65,54 @@ sub test_balance {
> is($nodes[1], "A", 'second should be A');
> }
>
> - $static->add_service_usage_to_node($nodes[0], $service);
> + $static->add_service_usage_to_node($nodes[0], $sid_names[$i], $service);
Nit: could also just use something like "vm:$i" or "vm:" . 100 + $i
rather than letters. If it's a variable-controlled enumeration that
seems more natural to me.
> + }
> +}
> +
---snip---
> @@ -96,9 +158,9 @@ sub test_balance_small_memory_difference {
> $static->add_node("C", 4, 8_000_000_000);
>
> if ($with_start_load) {
> - $static->add_service_usage_to_node("A", { maxcpu => 4, maxmem => 1_000_000_000 });
> - $static->add_service_usage_to_node("B", { maxcpu => 2, maxmem => 1_000_000_000 });
> - $static->add_service_usage_to_node("C", { maxcpu => 2, maxmem => 1_000_000_000 });
> + $static->add_service_usage_to_node("A", "a", { maxcpu => 4, maxmem => 1_000_000_000 });
> + $static->add_service_usage_to_node("B", "b", { maxcpu => 2, maxmem => 1_000_000_000 });
> + $static->add_service_usage_to_node("C", "c", { maxcpu => 2, maxmem => 1_000_000_000 });
> }
>
> my $service = {
> @@ -106,6 +168,7 @@ sub test_balance_small_memory_difference {
> maxmem => 16_000_000,
> };
>
> + my @sid_names = ("d" .. "z");
> for (my $i = 0; $i < 20; $i++) {
> my $score_list = $static->score_nodes_to_start_service($service);
>
> @@ -131,12 +194,13 @@ sub test_balance_small_memory_difference {
> die "internal error, got $i % 4 == " . ($i % 4) . "\n";
> }
>
> - $static->add_service_usage_to_node($nodes[0], $service);
> + $static->add_service_usage_to_node($nodes[0], $sid_names[$i], $service);
Nit: same as above regarding service names, we're already very close to
the edge of the alphabet in this case ;P
> }
> }
>
> test_basic();
> test_balance();
> +test_balance_removal();
> test_overcommitted();
> test_balance_small_memory_difference(1);
> test_balance_small_memory_difference(0);
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache Daniel Kral
@ 2025-10-16 11:12 ` Fiona Ebner
2025-10-16 15:15 ` Daniel Kral
0 siblings, 1 reply; 36+ messages in thread
From: Fiona Ebner @ 2025-10-16 11:12 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
> As the HA Manager builds the static load scheduler, it queries the
> services' static usage by reading and parsing the static guest configs
> individually, which can take significant time with respect to how many
> times recompute_online_node_usage(...) is called.
>
> PVE::Cluster exposes an efficient interface to gather a set of
> properties from one or all guest configs [0]. This is used here to build
> a rather short-lived cache on every (re)initialization of the static
> load scheduler.
>
> The downside to this approach is if there are way more non-HA managed
> guests in the cluster than HA-managed guests, which causes this to load
> much more information than necessary. It also doesn't cache the default
> values for the environment-specific static service usage, which causes
> quite a bottleneck as well.
>
> [0] pve-cluster cf1b19d (add get_guest_config_property IPCC method)
>
> Suggested-by: Fiona Ebner <f.ebner@proxmox.com>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
> If the above mentioned downside is too large for some setups, we could
> also just add a switch to enable/disable the static stats cache or
> automatically figure out if it brings any benefits.. But I think this
> will be good enough, especially with the latter patches making way fewer
> calls to get_service_usage(...).
Parsing the configs is much more costly than
PVE::Cluster::get_guest_config_properties(), so improving the situation
for the majority of use cases of the static scheduler still makes sense,
even if it hurts performance for some more exotic setups. And we would
like to consider usage of non-HA guests in the future too to make the
static scheduler more accurate even in such exotic setups. But with
PSI-based information that is already included in the node usage, no
extra preparation necessary there.
>
> src/PVE/HA/Env.pm | 12 ++++++++++++
> src/PVE/HA/Env/PVE2.pm | 21 +++++++++++++++++++++
> src/PVE/HA/Manager.pm | 1 +
Needs a rebase because of changes in Manager.pm
> @@ -497,6 +499,25 @@ sub get_datacenter_settings {
> };
> }
>
> +sub get_static_service_stats {
> + my ($self, $id) = @_;
> +
> + # undef if update_static_service_stats(...) failed before
> + return undef if !defined($self->{static_service_stats});
> +
> + return $self->{static_service_stats}->{$id} // {};
Can't returning '{}' when nothing is there lead to issues down the line?
If we return undef instead, it's consistent with not having anything
cached and the caller will fall back to loading the config.
> @@ -477,6 +467,8 @@ sub new {
>
> $self->{service_config} = $self->read_service_config();
>
> + $self->{static_service_stats} = undef;
> +
> return $self;
> }
>
> @@ -943,6 +935,25 @@ sub watchdog_update {
> return &$modify_watchog($self, $code);
> }
>
> +sub get_static_service_stats {
> + my ($self, $id) = @_;
> +
> + # undef if update_static_service_stats(...) failed before
> + return undef if !defined($self->{static_service_stats});
> +
> + return $self->{static_service_stats}->{$id} // {};
Same here.
> +}
> +
> +sub update_static_service_stats {
> + my ($self) = @_;
> +
> + my $filename = "$self->{statusdir}/static_service_stats";
> + my $stats = eval { PVE::HA::Tools::read_json_from_file($filename) };
> + $self->log('warning', "unable to update static service stats cache - $@") if $@;
> +
> + $self->{static_service_stats} = $stats;
> +}
> +
> sub get_static_node_stats {
> my ($self) = @_;
>
> diff --git a/src/PVE/HA/Sim/Resources.pm b/src/PVE/HA/Sim/Resources.pm
> index 72623ee1..7641b1a9 100644
> --- a/src/PVE/HA/Sim/Resources.pm
> +++ b/src/PVE/HA/Sim/Resources.pm
> @@ -143,8 +143,8 @@ sub get_static_stats {
> my $sid = $class->type() . ":$id";
> my $hardware = $haenv->hardware();
>
> - my $stats = $hardware->read_static_service_stats();
> - if (my $service_stats = $stats->{$sid}) {
> + my $service_stats = $hardware->get_static_service_stats($sid);
> + if (%$service_stats) {
Style nit: with the above change, this could be just $service_stats or a
definedness check for it. If really checking if there are any keys, I
prefer being explicit with scalar(keys ...)
> return $service_stats;
> } elsif ($id =~ /^(\d)(\d\d)/) {
> # auto assign usage calculated from ID for convenience
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery Daniel Kral
@ 2025-10-16 11:25 ` Fiona Ebner
0 siblings, 0 replies; 36+ messages in thread
From: Fiona Ebner @ 2025-10-16 11:25 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
> recompute_online_node_usage(...) is currently only dependent on the
> configured CRS scheduler mode, the list of online nodes and the HA
> resource's state, current node and optional migration target.
>
> The proper recovery of HA resources was introduced in 9da84a0d ("fix
> 'change_service_location' misuse and recovery from fencing") as a
> private helper to recover fenced HA resources and the
> recompute_online_node_usage(...) was needed here.
>
> As the recovery of fenced HA resources is its own state for HA resources
> since c259b1a8 ("manager: make recovery actual state in FSM"), the
> change_service_state(...) to 'recovery' will already call
> recompute_online_node_usage(...) before, making this call redundant.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> src/PVE/HA/Manager.pm | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index 3f81f233..f69a7fe9 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -1301,8 +1301,6 @@ sub next_state_recovery {
>
> my $fenced_node = $sd->{node}; # for logging purpose
>
> - $self->recompute_online_node_usage(); # we want the most current node state
> -
> my $recovery_node = select_service_node(
> $self->{rules},
> $self->{online_node_usage},
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node from next_state_recovery
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node " Daniel Kral
@ 2025-10-16 11:33 ` Fiona Ebner
0 siblings, 0 replies; 36+ messages in thread
From: Fiona Ebner @ 2025-10-16 11:33 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
> Since 5c2eef4b ("account service to source and target during move") a
> moving HA resource's load is accounted for on the source and target
> nodes.
>
> A HA resource is in the 'recovery' state after a successfully fenced
> node and its primary goal becomes to find a recovery node. As soon as a
> recovery node is found, the HA resource is moved there. As the previous
> node was fenced, there is only load on the recovery node.
>
> The add_service_usage_to_node(...) is redundant at this point as the
> subsequent call to change_service_state(...) to either the 'started' or
> 'request_stop' state will call recompute_online_node_usage(...) and make
> the changes to $online_node_usage be discarded immediately.
This was done with the plan to make it more granular in mind, explicitly
not relying on a particular recompute_online_node_usage() call. I guess
the necessary calculation will still be covered by
change_service_state() at the end of the series? If yes, please mention
that in the commit message here and take a
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> src/PVE/HA/Manager.pm | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index f69a7fe9..6afc9250 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -1321,8 +1321,6 @@ sub next_state_recovery {
> $fence_recovery_cleanup->($self, $sid, $fenced_node);
>
> $haenv->steal_service($sid, $sd->{node}, $recovery_node);
> - $self->{online_node_usage}->add_service_usage_to_node($recovery_node, $sid, $recovery_node);
> - $self->{online_node_usage}->add_service_node($sid, $recovery_node);
>
> # NOTE: $sd *is normally read-only*, fencing is the exception
> $cd->{node} = $sd->{node} = $recovery_node;
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started Daniel Kral
@ 2025-10-16 11:39 ` Fiona Ebner
0 siblings, 0 replies; 36+ messages in thread
From: Fiona Ebner @ 2025-10-16 11:39 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
> Since 5c2eef4b ("account service to source and target during move") a
> moving HA resource's load is accounted for on the source and target
> nodes.
>
> A HA resource in the 'started' state, which is not configured otherwise
> or has no pending CRM commands left to process, actively checks whether
> there is "better" node placement by querying select_service_node(...).
> When a better node is found, the HA resource will be migrated or
> relocated to the found node depending on their type.
>
> The add_service_usage_to_node(...) is redundant at this point as the
> subsequent call to change_service_state(...) to either the 'migrate' or
> 'relocate' state will call recompute_online_node_usage(...) and make
> the changes to $online_node_usage be discarded immediately.
Same here, if this is still covered by change_service_state() with more
granular operations at the end of the series please mention that here
and consider it:
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> src/PVE/HA/Manager.pm | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index 6afc9250..468e41eb 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -1194,9 +1194,6 @@ sub next_state_started {
> );
>
> if ($node && ($sd->{node} ne $node)) {
> - $self->{online_node_usage}->add_service_usage_to_node($node, $sid, $sd->{node});
> - $self->{online_node_usage}->add_service_node($sid, $node);
> -
> if (defined(my $fallback = $sd->{maintenance_node})) {
> if ($node eq $fallback) {
> $haenv->log(
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
2025-10-16 11:12 ` Fiona Ebner
@ 2025-10-16 15:15 ` Daniel Kral
2025-10-17 10:02 ` Fiona Ebner
0 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-10-16 15:15 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
On Thu Oct 16, 2025 at 1:12 PM CEST, Fiona Ebner wrote:
> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>> As the HA Manager builds the static load scheduler, it queries the
>> services' static usage by reading and parsing the static guest configs
>> individually, which can take significant time with respect to how many
>> times recompute_online_node_usage(...) is called.
>>
>> PVE::Cluster exposes an efficient interface to gather a set of
>> properties from one or all guest configs [0]. This is used here to build
>> a rather short-lived cache on every (re)initialization of the static
>> load scheduler.
>>
>> The downside to this approach is if there are way more non-HA managed
>> guests in the cluster than HA-managed guests, which causes this to load
>> much more information than necessary. It also doesn't cache the default
>> values for the environment-specific static service usage, which causes
>> quite a bottleneck as well.
>>
>> [0] pve-cluster cf1b19d (add get_guest_config_property IPCC method)
>>
>> Suggested-by: Fiona Ebner <f.ebner@proxmox.com>
>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>> ---
>> If the above mentioned downside is too large for some setups, we could
>> also just add a switch to enable/disable the static stats cache or
>> automatically figure out if it brings any benefits.. But I think this
>> will be good enough, especially with the latter patches making way fewer
>> calls to get_service_usage(...).
>
> Parsing the configs is much more costly than
> PVE::Cluster::get_guest_config_properties(), so improving the situation
> for the majority of use cases of the static scheduler still makes sense,
> even if it hurts performance for some more exotic setups. And we would
> like to consider usage of non-HA guests in the future too to make the
> static scheduler more accurate even in such exotic setups. But with
> PSI-based information that is already included in the node usage, no
> extra preparation necessary there.
I fully agree, that's a rather odd case and in hindsight I doubt that
this has any significant performance improvements for the most common
setups.
Not that important, but I'd move this patch after the granular changes
are implemented since only there this patch makes a performance
improvement while at this point in time and with this implementation it
actually degrades performance.
>
>>
>> src/PVE/HA/Env.pm | 12 ++++++++++++
>> src/PVE/HA/Env/PVE2.pm | 21 +++++++++++++++++++++
>> src/PVE/HA/Manager.pm | 1 +
>
> Needs a rebase because of changes in Manager.pm
>
>> @@ -497,6 +499,25 @@ sub get_datacenter_settings {
>> };
>> }
>>
>> +sub get_static_service_stats {
>> + my ($self, $id) = @_;
>> +
>> + # undef if update_static_service_stats(...) failed before
>> + return undef if !defined($self->{static_service_stats});
>> +
>> + return $self->{static_service_stats}->{$id} // {};
>
> Can't returning '{}' when nothing is there lead to issues down the line?
> If we return undef instead, it's consistent with not having anything
> cached and the caller will fall back to loading the config.
This return value type definitely needs improvement and/or better
documentation, but an undef $self->{static_service_stats}->{$id} value
indicates that it should fallback to the default value as none of the
properties requested by get_guest_config_properties(...) was included in
that particular guest config, e.g. no 'cores', 'sockets', and 'memory'
properties defined in a VM config.
When $self->{static_service_stats} itself is undef, then the static
cache couldn't be queried for some reason.
>
>> @@ -477,6 +467,8 @@ sub new {
>>
>> $self->{service_config} = $self->read_service_config();
>>
>> + $self->{static_service_stats} = undef;
>> +
>> return $self;
>> }
>>
>> @@ -943,6 +935,25 @@ sub watchdog_update {
>> return &$modify_watchog($self, $code);
>> }
>>
>> +sub get_static_service_stats {
>> + my ($self, $id) = @_;
>> +
>> + # undef if update_static_service_stats(...) failed before
>> + return undef if !defined($self->{static_service_stats});
>> +
>> + return $self->{static_service_stats}->{$id} // {};
>
> Same here.
>
>> +}
>> +
>> +sub update_static_service_stats {
>> + my ($self) = @_;
>> +
>> + my $filename = "$self->{statusdir}/static_service_stats";
>> + my $stats = eval { PVE::HA::Tools::read_json_from_file($filename) };
>> + $self->log('warning', "unable to update static service stats cache - $@") if $@;
>> +
>> + $self->{static_service_stats} = $stats;
>> +}
>> +
>> sub get_static_node_stats {
>> my ($self) = @_;
>>
>> diff --git a/src/PVE/HA/Sim/Resources.pm b/src/PVE/HA/Sim/Resources.pm
>> index 72623ee1..7641b1a9 100644
>> --- a/src/PVE/HA/Sim/Resources.pm
>> +++ b/src/PVE/HA/Sim/Resources.pm
>> @@ -143,8 +143,8 @@ sub get_static_stats {
>> my $sid = $class->type() . ":$id";
>> my $hardware = $haenv->hardware();
>>
>> - my $stats = $hardware->read_static_service_stats();
>> - if (my $service_stats = $stats->{$sid}) {
>> + my $service_stats = $hardware->get_static_service_stats($sid);
>> + if (%$service_stats) {
>
> Style nit: with the above change, this could be just $service_stats or a
> definedness check for it. If really checking if there are any keys, I
> prefer being explicit with scalar(keys ...)
ACK will do that, don't know why I didn't see that already :)
>
>> return $service_stats;
>> } elsif ($id =~ /^(\d)(\d\d)/) {
>> # auto assign usage calculated from ID for convenience
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes
2025-10-16 10:32 ` Fiona Ebner
@ 2025-10-16 15:34 ` Daniel Kral
2025-10-17 10:55 ` Fiona Ebner
0 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-10-16 15:34 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
On Thu Oct 16, 2025 at 12:32 PM CEST, Fiona Ebner wrote:
> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>> Implements a simple bidirectional map to track which service usages have
>> been added to nodes, so that these can be removed later individually.
>>
>> The StaticServiceUsage is added to the HashMap<> in StaticNodeInfo to
>> reduce unnecessary indirection when summing these values in
>> score_nodes_to_start_service(...).
>>
>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>
> Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
>
>> ---
>> I started out adding and removing the service_usage in StaticNodeUsage
>> on each add_service_usage_to_node(...) and remove_service_usage(...)
>> call individually, but I think summing those up every time is better for
>> numerical stability..
>
> This can go into the commit message too if worded a bit less informally ;)
Will do! Even though I asked myself if the concern about numerical
stability really holds true here if we don't go for ha #9 as then these
values will only change over a single manage(...) run and I'd guess that
most users would use floating-point values which can be represented
exactly (0.0, 1.0, 2.0, 3.0, ... and 0.5, 0.25, 0.125, ...). For these,
numerical stability shouldn't be a problem at all, or not?
>
>>
>> .../bindings/resource_scheduling_static.rs | 98 ++++++++++++++++---
>> pve-rs/test/resource_scheduling.pl | 84 ++++++++++++++--
>> 2 files changed, 160 insertions(+), 22 deletions(-)
>>
>> diff --git a/pve-rs/src/bindings/resource_scheduling_static.rs b/pve-rs/src/bindings/resource_scheduling_static.rs
>> index 7cf2f35..535dfe3 100644
>> --- a/pve-rs/src/bindings/resource_scheduling_static.rs
>> +++ b/pve-rs/src/bindings/resource_scheduling_static.rs
>> @@ -16,8 +16,16 @@ pub mod pve_rs_resource_scheduling_static {
>>
>> perlmod::declare_magic!(Box<Scheduler> : &Scheduler as "PVE::RS::ResourceScheduling::Static");
>>
>> + struct StaticNodeInfo {
>> + name: String,
>> + maxcpu: usize,
>> + maxmem: usize,
>> + services: HashMap<String, StaticServiceUsage>,
>> + }
>> +
>> struct Usage {
>> - nodes: HashMap<String, StaticNodeUsage>,
>> + nodes: HashMap<String, StaticNodeInfo>,
>> + service_nodes: HashMap<String, Vec<String>>,
>
> Rather than Vec it could be a HashSet, but I'm fine with either way.
>
>> }
>>
>> /// A scheduler instance contains the resource usage by node.
>
> ---snip---
>
>> @@ -67,10 +75,25 @@ pub mod pve_rs_resource_scheduling_static {
>>
>> /// Method: Remove a node from the scheduler.
>> #[export]
>> - pub fn remove_node(#[try_from_ref] this: &Scheduler, nodename: &str) {
>> + pub fn remove_node(#[try_from_ref] this: &Scheduler, nodename: &str) -> Result<(), Error> {
>> let mut usage = this.inner.lock().unwrap();
>>
>> - usage.nodes.remove(nodename);
>> + if let Some(node) = usage.nodes.remove(nodename) {
>> + for (sid, _) in node.services.iter() {
>> + match usage.service_nodes.get_mut(sid) {
>> + Some(nodes) => {
>> + nodes.retain(|node| !node.eq(nodename));
>> + }
>> + None => bail!(
>> + "service '{}' not present while removing node '{}'",
>
> Nit: for a user (or developer in 6 months time :P), this error message
> is rather unspecific. I'd suggest:
> "internal error: ... not present in service_nodes hash map while ..."
Will do as those are always helpful but hopefully this won't reach any
users as it's a consistency check of the implementation itself ;)
>
>> + sid,
>> + nodename
>> + ),
>> + }
>> + }
>> + }
>> +
>> + Ok(())
>> }
>>
>> /// Method: Get a list of all the nodes in the scheduler.
>> @@ -93,22 +116,55 @@ pub mod pve_rs_resource_scheduling_static {
>> usage.nodes.contains_key(nodename)
>> }
>>
>> - /// Method: Add usage of `service` to the node's usage.
>> + /// Method: Add service `sid` and its `service_usage` to the node.
>> #[export]
>> pub fn add_service_usage_to_node(
>> #[try_from_ref] this: &Scheduler,
>> nodename: &str,
>> - service: StaticServiceUsage,
>> + sid: &str,
>> + service_usage: StaticServiceUsage,
>
> So this requires a versioned Breaks in d/control. Please mention this in
> the commit notes and in the cover letter.
ACK will add that information here and will keep that in mind for future
patch series!
>
>> ) -> Result<(), Error> {
>> let mut usage = this.inner.lock().unwrap();
>>
>> match usage.nodes.get_mut(nodename) {
>> Some(node) => {
>> - node.add_service_usage(&service);
>> - Ok(())
>> + if node.services.contains_key(sid) {
>> + bail!("service '{}' already added to node '{}'", sid, nodename);
>> + }
>> +
>> + node.services.insert(sid.to_string(), service_usage);
>> }
>> None => bail!("node '{}' not present in usage hashmap", nodename),
>> }
>> +
>> + if let Some(nodes) = usage.service_nodes.get_mut(sid) {
>> + nodes.push(nodename.to_string());
>> + } else {
>> + usage
>> + .service_nodes
>> + .insert(sid.to_string(), vec![nodename.to_string()]);
>> + }
>
> Nit: just to be sure, we could check whether the node would be duplicate
> in service_nodes too (is more straightforward with using HashSet). Maybe
> not really worth it though, because if we keep the mappings node ->
> services and service -> nodes consistent with each other (which we of
> course should) then the check above already protects us.
I'll check that again, but I'd go for the HashSet idea then as it fits
better conceptually. Also because I was curious:
```
println!("HashMap<String, Vec<String>>: {}", size_of::<HashMap<String, Vec<String>>>());
println!("HashMap<String, HashSet<String>>: {}", size_of::<HashMap<String, HashSet<String>>>());
---
HashMap<String, Vec<String>>: 48
HashMap<String, HashSet<String>>: 48
```
although
```
println!("Vec<String>: {}", size_of::<Vec<String>>());
println!("HashSet<String>: {}", size_of::<HashSet<String>>());
---
Vec<String>: 24
HashSet<String>: 48
```
>
> ---snip---
>
>> @@ -50,7 +65,54 @@ sub test_balance {
>> is($nodes[1], "A", 'second should be A');
>> }
>>
>> - $static->add_service_usage_to_node($nodes[0], $service);
>> + $static->add_service_usage_to_node($nodes[0], $sid_names[$i], $service);
>
> Nit: could also just use something like "vm:$i" or "vm:" . 100 + $i
> rather than letters. If it's a variable-controlled enumeration that
> seems more natural to me.
Right, I wanted to stick to the alphabet usage here but that makes much
more sense for something that is generated ^^.
>
>> + }
>> +}
>> +
>
> ---snip---
>
>> @@ -96,9 +158,9 @@ sub test_balance_small_memory_difference {
>> $static->add_node("C", 4, 8_000_000_000);
>>
>> if ($with_start_load) {
>> - $static->add_service_usage_to_node("A", { maxcpu => 4, maxmem => 1_000_000_000 });
>> - $static->add_service_usage_to_node("B", { maxcpu => 2, maxmem => 1_000_000_000 });
>> - $static->add_service_usage_to_node("C", { maxcpu => 2, maxmem => 1_000_000_000 });
>> + $static->add_service_usage_to_node("A", "a", { maxcpu => 4, maxmem => 1_000_000_000 });
>> + $static->add_service_usage_to_node("B", "b", { maxcpu => 2, maxmem => 1_000_000_000 });
>> + $static->add_service_usage_to_node("C", "c", { maxcpu => 2, maxmem => 1_000_000_000 });
>> }
>>
>> my $service = {
>> @@ -106,6 +168,7 @@ sub test_balance_small_memory_difference {
>> maxmem => 16_000_000,
>> };
>>
>> + my @sid_names = ("d" .. "z");
>> for (my $i = 0; $i < 20; $i++) {
>> my $score_list = $static->score_nodes_to_start_service($service);
>>
>> @@ -131,12 +194,13 @@ sub test_balance_small_memory_difference {
>> die "internal error, got $i % 4 == " . ($i % 4) . "\n";
>> }
>>
>> - $static->add_service_usage_to_node($nodes[0], $service);
>> + $static->add_service_usage_to_node($nodes[0], $sid_names[$i], $service);
>
> Nit: same as above regarding service names, we're already very close to
> the edge of the alphabet in this case ;P
>
>> }
>> }
>>
>> test_basic();
>> test_balance();
>> +test_balance_removal();
>> test_overcommitted();
>> test_balance_small_memory_difference(1);
>> test_balance_small_memory_difference(0);
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
2025-10-16 15:15 ` Daniel Kral
@ 2025-10-17 10:02 ` Fiona Ebner
2025-10-17 10:08 ` Fiona Ebner
0 siblings, 1 reply; 36+ messages in thread
From: Fiona Ebner @ 2025-10-17 10:02 UTC (permalink / raw)
To: Daniel Kral, Proxmox VE development discussion
Am 16.10.25 um 5:15 PM schrieb Daniel Kral:
> On Thu Oct 16, 2025 at 1:12 PM CEST, Fiona Ebner wrote:
>> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>>> @@ -497,6 +499,25 @@ sub get_datacenter_settings {
>>> };
>>> }
>>>
>>> +sub get_static_service_stats {
>>> + my ($self, $id) = @_;
>>> +
>>> + # undef if update_static_service_stats(...) failed before
>>> + return undef if !defined($self->{static_service_stats});
>>> +
>>> + return $self->{static_service_stats}->{$id} // {};
>>
>> Can't returning '{}' when nothing is there lead to issues down the line?
>> If we return undef instead, it's consistent with not having anything
>> cached and the caller will fall back to loading the config.
>
> This return value type definitely needs improvement and/or better
> documentation, but an undef $self->{static_service_stats}->{$id} value
> indicates that it should fallback to the default value as none of the
> properties requested by get_guest_config_properties(...) was included in
> that particular guest config, e.g. no 'cores', 'sockets', and 'memory'
> properties defined in a VM config.
> When $self->{static_service_stats} itself is undef, then the static
> cache couldn't be queried for some reason.
Okay, so get_guest_config_properties() only includes guests that do have
one of the queried properties explicitly set in its result. Thus, we
cannot distinguish between the cache being created at a time when a
guest did not exist yet or a guest with none of the queried properties
explicitly set. If returning {} as a fallback, we get the wrong values
in the former case, if returning undef as a fallback, we just have to
explicitly load the config in the latter case. There probably are not
many setups with many guests without any of the queried properties
explicitly set, so that is unlikely to hurt performance in practice.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
2025-10-17 10:02 ` Fiona Ebner
@ 2025-10-17 10:08 ` Fiona Ebner
2025-10-17 16:18 ` Daniel Kral
0 siblings, 1 reply; 36+ messages in thread
From: Fiona Ebner @ 2025-10-17 10:08 UTC (permalink / raw)
To: Daniel Kral, Proxmox VE development discussion
Am 17.10.25 um 12:02 PM schrieb Fiona Ebner:
> Am 16.10.25 um 5:15 PM schrieb Daniel Kral:
>> On Thu Oct 16, 2025 at 1:12 PM CEST, Fiona Ebner wrote:
>>> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>>>> @@ -497,6 +499,25 @@ sub get_datacenter_settings {
>>>> };
>>>> }
>>>>
>>>> +sub get_static_service_stats {
>>>> + my ($self, $id) = @_;
>>>> +
>>>> + # undef if update_static_service_stats(...) failed before
>>>> + return undef if !defined($self->{static_service_stats});
>>>> +
>>>> + return $self->{static_service_stats}->{$id} // {};
>>>
>>> Can't returning '{}' when nothing is there lead to issues down the line?
>>> If we return undef instead, it's consistent with not having anything
>>> cached and the caller will fall back to loading the config.
>>
>> This return value type definitely needs improvement and/or better
>> documentation, but an undef $self->{static_service_stats}->{$id} value
>> indicates that it should fallback to the default value as none of the
>> properties requested by get_guest_config_properties(...) was included in
>> that particular guest config, e.g. no 'cores', 'sockets', and 'memory'
>> properties defined in a VM config.
>> When $self->{static_service_stats} itself is undef, then the static
>> cache couldn't be queried for some reason.
>
> Okay, so get_guest_config_properties() only includes guests that do have
> one of the queried properties explicitly set in its result. Thus, we
> cannot distinguish between the cache being created at a time when a
> guest did not exist yet or a guest with none of the queried properties
> explicitly set. If returning {} as a fallback, we get the wrong values
> in the former case, if returning undef as a fallback, we just have to
> explicitly load the config in the latter case. There probably are not
> many setups with many guests without any of the queried properties
> explicitly set, so that is unlikely to hurt performance in practice.
Or additionally, we could initialize the cache with {} for the guests
that exist at that moment, but do not have any of the queried properties
explicitly set.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes
2025-10-16 15:34 ` Daniel Kral
@ 2025-10-17 10:55 ` Fiona Ebner
0 siblings, 0 replies; 36+ messages in thread
From: Fiona Ebner @ 2025-10-17 10:55 UTC (permalink / raw)
To: Daniel Kral, Proxmox VE development discussion
Am 16.10.25 um 5:34 PM schrieb Daniel Kral:
> On Thu Oct 16, 2025 at 12:32 PM CEST, Fiona Ebner wrote:
>> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>>> Implements a simple bidirectional map to track which service usages have
>>> been added to nodes, so that these can be removed later individually.
>>>
>>> The StaticServiceUsage is added to the HashMap<> in StaticNodeInfo to
>>> reduce unnecessary indirection when summing these values in
>>> score_nodes_to_start_service(...).
>>>
>>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>>
>> Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
>>
>>> ---
>>> I started out adding and removing the service_usage in StaticNodeUsage
>>> on each add_service_usage_to_node(...) and remove_service_usage(...)
>>> call individually, but I think summing those up every time is better for
>>> numerical stability..
>>
>> This can go into the commit message too if worded a bit less informally ;)
>
> Will do! Even though I asked myself if the concern about numerical
> stability really holds true here if we don't go for ha #9 as then these
> values will only change over a single manage(...) run and I'd guess that
> most users would use floating-point values which can be represented
> exactly (0.0, 1.0, 2.0, 3.0, ... and 0.5, 0.25, 0.125, ...). For these,
> numerical stability shouldn't be a problem at all, or not?
I guess cpulimit for containers is the obvious exception ;)
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class Daniel Kral
@ 2025-10-17 11:14 ` Fiona Ebner
2025-10-17 15:46 ` Daniel Kral
0 siblings, 1 reply; 36+ messages in thread
From: Fiona Ebner @ 2025-10-17 11:14 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
> The resource affinity rules need information about the other HA
> resource's used nodes to be enacted correctly, which has been proxied
> through $online_node_usage before.
>
> The get_used_service_nodes(...) reflects the same logic to retrieve the
> nodes, where a HA resource $sid currently puts load on, as in
> recompute_online_node_usage(...).
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
with some comments:
> ---
> I wanted to put this information in PVE::HA::Usage directly, but figured
> it would make Usage much more dependent on $ss / $sd.. We can still do
> that later on or move the helper elsewhere, e.g. making ServiceStatus
> its own class?
I think having the get_used_service_nodes() helper in PVE::HA::Usage is
better than in PVE::HA::Tools, because to me, "tools" sounds sounds like
things that should not be concerned with concrete HA state logic. You
could adapt the signature to take e.g.
($state, $online_nodes, $node, $target)
rather than $sd to avoid any need to know about its internal structure.
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index 468e41eb..0226e427 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -121,12 +121,12 @@ sub flush_master_status {
>
> =head3 select_service_node(...)
>
> -=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $sd, $node_preference)
> +=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $ss, $node_preference)
>
Pre-existing, but is this duplicate heading intentional?
> diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
> index 71eb5d0b..7f718e25 100644
> --- a/src/PVE/HA/Tools.pm
> +++ b/src/PVE/HA/Tools.pm
> @@ -188,6 +188,35 @@ sub count_fenced_services {
> return $count;
> }
>
> +sub get_used_service_nodes {
> + my ($sd, $online_nodes) = @_;
> +
> + my $result = {};
> +
> + my ($state, $node, $target) = $sd->@{qw(state node target)};
> +
> + return $result if $state eq 'stopped' || $state eq 'request_start';
> +
> + if (
> + $state eq 'started'
> + || $state eq 'request_stop'
> + || $state eq 'fence'
> + || $state eq 'freeze'
> + || $state eq 'error'
> + || $state eq 'recovery'
> + || $state eq 'migrate'
> + || $state eq 'relocate'
> + ) {
> + $result->{current} = $node if $online_nodes->{$node};
> + }
> +
> + if ($state eq 'migrate' || $state eq 'relocate' || $state eq 'request_start_balance') {
> + $result->{target} = $target if defined($target) && $online_nodes->{$target};
> + }
> +
> + return $result;
Returning ($current, $target) seems to better match what callers need.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper Daniel Kral
@ 2025-10-17 11:25 ` Fiona Ebner
0 siblings, 0 replies; 36+ messages in thread
From: Fiona Ebner @ 2025-10-17 11:25 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
> As the previously introduced PVE::HA::Tools::get_used_service_nodes(...)
> helper reflects the same logic as in recompute_online_node_usage(...),
> use the helper instead to reduce code duplication.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations Daniel Kral
@ 2025-10-17 11:57 ` Fiona Ebner
0 siblings, 0 replies; 36+ messages in thread
From: Fiona Ebner @ 2025-10-17 11:57 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
> This makes use of the new signature for add_service_usage_to_node(...)
> from the Proxmox::RS::ResourceScheduling::Static package, which allows
> tracking of which HA resources have been assigned to which nodes.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
with some style nits below:
(also good to have a note: dependency bump for pve-rs needed)
> @@ -51,10 +51,18 @@ sub add_service_usage_to_node {
> }
> }
>
> +sub remove_service_usage {
> + my ($self, $sid) = @_;
> +
> + for my $node ($self->list_nodes()) {
> + delete $self->{nodes}->{$node}->{$sid};
> + }
> +}
> +
> sub score_nodes_to_start_service {
> my ($self, $sid, $service_node) = @_;
>
> - return $self->{nodes};
> + return { map { $_ => scalar keys $self->{nodes}->{$_}->%* } keys $self->{nodes}->%* };
Style nit: I feel like 'scalar' is more readable when surrounding its
argument with parentheses.
Style nit: Maybe add
my $nodes = $self->{nodes};
up front to defuse the lengthy expression a bit?
> @@ -89,16 +89,27 @@ my sub get_service_usage {
> sub add_service_usage_to_node {
> my ($self, $nodename, $sid, $service_node, $migration_target) = @_;
>
> - $self->{'service-counts'}->{$nodename}++;
> + $self->{'node-services'}->{$nodename}->{$sid} = 1;
>
> eval {
> my $service_usage = get_service_usage($self, $sid, $service_node, $migration_target);
> - $self->{scheduler}->add_service_usage_to_node($nodename, $service_usage);
> + $self->{scheduler}->add_service_usage_to_node($nodename, $sid, $service_usage);
> };
> $self->{haenv}->log('warning', "unable to add service '$sid' usage to node '$nodename' - $@")
> if $@;
> }
>
> +sub remove_service_usage {
> + my ($self, $sid) = @_;
> +
> + delete $self->{'node-services'}->{$_}->{$sid} for $self->list_nodes();
Style nit: add parentheses for the delete(), since other code follows it
(and the argument is quite lengthy too):
https://pve.proxmox.com/wiki/Perl_Style_Guide#Perl_syntax_choices
> +
> + eval { $self->{scheduler}->remove_service_usage($sid) };
> + $self->{haenv}->log('warning', "unable to remove service '$sid' usage - $@") if $@;
> +
> + delete $self->{'service-stats'}->{$sid}; # Invalidate old service stats
> +}
> +
> sub score_nodes_to_start_service {
> my ($self, $sid, $service_node) = @_;
>
> @@ -111,7 +122,10 @@ sub score_nodes_to_start_service {
> 'err',
> "unable to score nodes according to static usage for service '$sid' - $err",
> );
> - return $self->{'service-counts'};
> + return {
> + map { $_ => scalar keys $self->{'node-services'}->{$_}->%* }
Style nit: same as above regarding parentheses and using a variable to
improve the readability
> + keys $self->{'node-services'}->%*
> + };
> }
>
> # Take minus the value, so that a lower score is better, which our caller(s) expect(s).
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular Daniel Kral
@ 2025-10-17 12:32 ` Fiona Ebner
2025-10-17 16:07 ` Daniel Kral
0 siblings, 1 reply; 36+ messages in thread
From: Fiona Ebner @ 2025-10-17 12:32 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
> The HA Manager builds $online_node_usage in every FSM iteration in
> manage(...) and at every HA resource state change in
> change_service_state(...). This becomes quite costly with a high HA
> resource count and a lot of state changes happening at once, e.g.
> starting up multiple nodes with rebalance_on_request_start set or a
> failover of a node with many configured HA resources.
>
> To improve this situation, make the changes to the $online_node_usage
> more granular by building $online_node_usage only once per call to
> manage(...) and changing the nodes a HA resource uses individually on
> every HA resource state transition.
>
> The change in service usage "freshness" should be negligible here as the
> static service usage data is cached anyway (except if the cache fails
> for some reason).
But the cache is refreshed on every recompute_online_node_usage(), which
happened much more frequently before, so the fact that it's cached
doesn't seem like a strong argument here?
I /do/ think there is a real tradeoff being made, namely "the ability to
manage much larger fleets of guests" versus "immediately incorporating
every guest config change in decisions". Config changes that would lead
to wildly different decisions would need to be timed very badly to cause
actual issues and should be rare to begin with. Also, with PSI-based
information, things are also less "instant", I don't see an issue with
moving in the same direction.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> The add_service_usage(...) helper is added in anticipation for the next
> patch, we don't need a helper if we don't go for #9.
I think it's nice to have regardless. Inlining the function would just
bloat change_service_state() or what would be the alternative?
> @@ -314,7 +329,8 @@ my $change_service_state = sub {
> $sd->{$k} = $v;
> }
>
> - $self->recompute_online_node_usage();
> + $self->{online_node_usage}->remove_service_usage($sid);
> + $self->add_service_usage($sid, $sd);
Nice!
>
> $sd->{uid} = compute_new_uuid($new_state);
>
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular Daniel Kral
@ 2025-10-17 12:42 ` Fiona Ebner
2025-10-17 15:59 ` Daniel Kral
0 siblings, 1 reply; 36+ messages in thread
From: Fiona Ebner @ 2025-10-17 12:42 UTC (permalink / raw)
To: Proxmox VE development discussion, Daniel Kral
Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
> The $online_node_usage is built on every call to manage(...) now, but
> can be reduced to only be built on any scheduler mode change (including
> initialization or error path to be complete).
>
> This allows recompute_online_node_usage(...) to be reduced to
> adding/removing nodes whenever these become online or are not online
> anymore and handle the service usage updates whenever these change.
> Therefore, recompute_online_node_usage(...) must only be called once in
> manage(...) after $ns was properly updated.
>
> Note that this makes the ha-manager not acknowledge any hotplug changes
> to the guest configs anymore as long as the HA resource state doesn't
> change.
I'm not comfortable with that to be honest, because it would not just be
a very badly timed large change that can lead to unexpected decisions,
but an accumulation of smaller changes without any bad timing.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
> If we go for this patch, then we would need some mechanism to update the
> static usage for a single or all HA resources registered in
> $online_node_usage at once (or just rebuilt $online_node_usage at that
> point..).
You mean triggered from qemu-server/pve-container upon update? In
combination with that it would be acceptable I think. Question is, do we
want to spend even more time optimizing the static scheduler, or just
apply a v2 without patch 9/9 and rather focus on getting a PSI-based
scheduler going?
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class
2025-10-17 11:14 ` Fiona Ebner
@ 2025-10-17 15:46 ` Daniel Kral
2025-10-20 15:18 ` Fiona Ebner
0 siblings, 1 reply; 36+ messages in thread
From: Daniel Kral @ 2025-10-17 15:46 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
On Fri Oct 17, 2025 at 1:14 PM CEST, Fiona Ebner wrote:
> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>> I wanted to put this information in PVE::HA::Usage directly, but figured
>> it would make Usage much more dependent on $ss / $sd.. We can still do
>> that later on or move the helper elsewhere, e.g. making ServiceStatus
>> its own class?
>
> I think having the get_used_service_nodes() helper in PVE::HA::Usage is
> better than in PVE::HA::Tools, because to me, "tools" sounds sounds like
> things that should not be concerned with concrete HA state logic. You
> could adapt the signature to take e.g.
> ($state, $online_nodes, $node, $target)
> rather than $sd to avoid any need to know about its internal structure.
Sounds great, I'll do that!
>
>> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
>> index 468e41eb..0226e427 100644
>> --- a/src/PVE/HA/Manager.pm
>> +++ b/src/PVE/HA/Manager.pm
>> @@ -121,12 +121,12 @@ sub flush_master_status {
>>
>> =head3 select_service_node(...)
>>
>> -=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $sd, $node_preference)
>> +=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $ss, $node_preference)
>>
>
> Pre-existing, but is this duplicate heading intentional?
Yes, I picked it up from the Perl documentation RFC. This makes it
easier to reference it in perlpod when one can use only the short
heading instead of having to write out the heading with all of the
parameters, while the latter will be used for the LSP AFAICT (which I
actually don't use myself..).
L<< selects the service's node|/select_service_node(...) >>>
This one could be removed because currently it isn't used anywhere.
>
>> diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
>> index 71eb5d0b..7f718e25 100644
>> --- a/src/PVE/HA/Tools.pm
>> +++ b/src/PVE/HA/Tools.pm
>> @@ -188,6 +188,35 @@ sub count_fenced_services {
>> return $count;
>> }
>>
>> +sub get_used_service_nodes {
>> + my ($sd, $online_nodes) = @_;
>> +
>> + my $result = {};
>> +
>> + my ($state, $node, $target) = $sd->@{qw(state node target)};
>> +
>> + return $result if $state eq 'stopped' || $state eq 'request_start';
>> +
>> + if (
>> + $state eq 'started'
>> + || $state eq 'request_stop'
>> + || $state eq 'fence'
>> + || $state eq 'freeze'
>> + || $state eq 'error'
>> + || $state eq 'recovery'
>> + || $state eq 'migrate'
>> + || $state eq 'relocate'
>> + ) {
>> + $result->{current} = $node if $online_nodes->{$node};
>> + }
>> +
>> + if ($state eq 'migrate' || $state eq 'relocate' || $state eq 'request_start_balance') {
>> + $result->{target} = $target if defined($target) && $online_nodes->{$target};
>> + }
>> +
>> + return $result;
>
> Returning ($current, $target) seems to better match what callers need.
Right, will do!
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular
2025-10-17 12:42 ` Fiona Ebner
@ 2025-10-17 15:59 ` Daniel Kral
0 siblings, 0 replies; 36+ messages in thread
From: Daniel Kral @ 2025-10-17 15:59 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
On Fri Oct 17, 2025 at 2:42 PM CEST, Fiona Ebner wrote:
> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>> The $online_node_usage is built on every call to manage(...) now, but
>> can be reduced to only be built on any scheduler mode change (including
>> initialization or error path to be complete).
>>
>> This allows recompute_online_node_usage(...) to be reduced to
>> adding/removing nodes whenever these become online or are not online
>> anymore and handle the service usage updates whenever these change.
>> Therefore, recompute_online_node_usage(...) must only be called once in
>> manage(...) after $ns was properly updated.
>>
>> Note that this makes the ha-manager not acknowledge any hotplug changes
>> to the guest configs anymore as long as the HA resource state doesn't
>> change.
>
> I'm not comfortable with that to be honest, because it would not just be
> a very badly timed large change that can lead to unexpected decisions,
> but an accumulation of smaller changes without any bad timing.
>
>>
>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>> ---
>> If we go for this patch, then we would need some mechanism to update the
>> static usage for a single or all HA resources registered in
>> $online_node_usage at once (or just rebuilt $online_node_usage at that
>> point..).
>
> You mean triggered from qemu-server/pve-container upon update? In
> combination with that it would be acceptable I think. Question is, do we
> want to spend even more time optimizing the static scheduler, or just
> apply a v2 without patch 9/9 and rather focus on getting a PSI-based
> scheduler going?
Right, the patch was also more of a leftover from an initial approach
but wanted to still get feedback if there's any benefit to do it that
way, but in hindsight it probably only adds unnecessary complexity and
might even be an overhead at long last which could introduce weird bugs.
Especially since the performance now is very acceptable, I don't see a
reason to optimize here further until we find a better reason for that.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular
2025-10-17 12:32 ` Fiona Ebner
@ 2025-10-17 16:07 ` Daniel Kral
0 siblings, 0 replies; 36+ messages in thread
From: Daniel Kral @ 2025-10-17 16:07 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
On Fri Oct 17, 2025 at 2:32 PM CEST, Fiona Ebner wrote:
> Am 30.09.25 um 4:20 PM schrieb Daniel Kral:
>> The HA Manager builds $online_node_usage in every FSM iteration in
>> manage(...) and at every HA resource state change in
>> change_service_state(...). This becomes quite costly with a high HA
>> resource count and a lot of state changes happening at once, e.g.
>> starting up multiple nodes with rebalance_on_request_start set or a
>> failover of a node with many configured HA resources.
>>
>> To improve this situation, make the changes to the $online_node_usage
>> more granular by building $online_node_usage only once per call to
>> manage(...) and changing the nodes a HA resource uses individually on
>> every HA resource state transition.
>>
>> The change in service usage "freshness" should be negligible here as the
>> static service usage data is cached anyway (except if the cache fails
>> for some reason).
>
> But the cache is refreshed on every recompute_online_node_usage(), which
> happened much more frequently before, so the fact that it's cached
> doesn't seem like a strong argument here?
>
> I /do/ think there is a real tradeoff being made, namely "the ability to
> manage much larger fleets of guests" versus "immediately incorporating
> every guest config change in decisions". Config changes that would lead
> to wildly different decisions would need to be timed very badly to cause
> actual issues and should be rare to begin with. Also, with PSI-based
> information, things are also less "instant", I don't see an issue with
> moving in the same direction.
Right, I'll change that to better reflect the tradeoff here!
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
2025-10-17 10:08 ` Fiona Ebner
@ 2025-10-17 16:18 ` Daniel Kral
0 siblings, 0 replies; 36+ messages in thread
From: Daniel Kral @ 2025-10-17 16:18 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
On Fri Oct 17, 2025 at 12:08 PM CEST, Fiona Ebner wrote:
> Am 17.10.25 um 12:02 PM schrieb Fiona Ebner:
>> Am 16.10.25 um 5:15 PM schrieb Daniel Kral:
>>> On Thu Oct 16, 2025 at 1:12 PM CEST, Fiona Ebner wrote:
>>>> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>>>>> @@ -497,6 +499,25 @@ sub get_datacenter_settings {
>>>>> };
>>>>> }
>>>>>
>>>>> +sub get_static_service_stats {
>>>>> + my ($self, $id) = @_;
>>>>> +
>>>>> + # undef if update_static_service_stats(...) failed before
>>>>> + return undef if !defined($self->{static_service_stats});
>>>>> +
>>>>> + return $self->{static_service_stats}->{$id} // {};
>>>>
>>>> Can't returning '{}' when nothing is there lead to issues down the line?
>>>> If we return undef instead, it's consistent with not having anything
>>>> cached and the caller will fall back to loading the config.
>>>
>>> This return value type definitely needs improvement and/or better
>>> documentation, but an undef $self->{static_service_stats}->{$id} value
>>> indicates that it should fallback to the default value as none of the
>>> properties requested by get_guest_config_properties(...) was included in
>>> that particular guest config, e.g. no 'cores', 'sockets', and 'memory'
>>> properties defined in a VM config.
>>> When $self->{static_service_stats} itself is undef, then the static
>>> cache couldn't be queried for some reason.
>>
>> Okay, so get_guest_config_properties() only includes guests that do have
>> one of the queried properties explicitly set in its result. Thus, we
Yes, I guess that was because of the initial use case of quickly getting
all guests which have 'tags' or 'lock' in their config files, which
don't need any information about the guests who don't have these.
>> cannot distinguish between the cache being created at a time when a
>> guest did not exist yet or a guest with none of the queried properties
>> explicitly set. If returning {} as a fallback, we get the wrong values
>> in the former case, if returning undef as a fallback, we just have to
>> explicitly load the config in the latter case. There probably are not
>> many setups with many guests without any of the queried properties
>> explicitly set, so that is unlikely to hurt performance in practice.
>
> Or additionally, we could initialize the cache with {} for the guests
> that exist at that moment, but do not have any of the queried properties
> explicitly set.
Right, the cache interface should definitely be decoupled from how it is
populated, especially if it might be filled by something other than
get_guest_config_properties(...).
I think I'll go for the idea for the wrapper over
get_guest_config_properties(...), which can quickly set {} explicitly
for all the guests (queried from PVE::Cluster's vmlist) without any of
the queried properties.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class
2025-10-17 15:46 ` Daniel Kral
@ 2025-10-20 15:18 ` Fiona Ebner
0 siblings, 0 replies; 36+ messages in thread
From: Fiona Ebner @ 2025-10-20 15:18 UTC (permalink / raw)
To: Daniel Kral, Proxmox VE development discussion
Am 17.10.25 um 5:46 PM schrieb Daniel Kral:
> On Fri Oct 17, 2025 at 1:14 PM CEST, Fiona Ebner wrote:
>> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>>> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
>>> index 468e41eb..0226e427 100644
>>> --- a/src/PVE/HA/Manager.pm
>>> +++ b/src/PVE/HA/Manager.pm
>>> @@ -121,12 +121,12 @@ sub flush_master_status {
>>>
>>> =head3 select_service_node(...)
>>>
>>> -=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $sd, $node_preference)
>>> +=head3 select_service_node($rules, $online_node_usage, $sid, $service_conf, $ss, $node_preference)
>>>
>>
>> Pre-existing, but is this duplicate heading intentional?
>
> Yes, I picked it up from the Perl documentation RFC. This makes it
> easier to reference it in perlpod when one can use only the short
> heading instead of having to write out the heading with all of the
> parameters, while the latter will be used for the LSP AFAICT (which I
> actually don't use myself..).
>
> L<< selects the service's node|/select_service_node(...) >>>
>
> This one could be removed because currently it isn't used anywhere.
Oh, good to know! I was just wondering, I'm fine either way.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
* [pve-devel] superseded: [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
` (11 preceding siblings ...)
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular Daniel Kral
@ 2025-10-20 16:50 ` Daniel Kral
12 siblings, 0 replies; 36+ messages in thread
From: Daniel Kral @ 2025-10-20 16:50 UTC (permalink / raw)
To: Proxmox VE development discussion
On Tue Sep 30, 2025 at 4:19 PM CEST, Daniel Kral wrote:
> An initial RFC for making the changes to the HA $online_node_usage
> object more granular.
Superseded-by: https://lore.proxmox.com/pve-devel/20251020164540.517231-1-d.kral@proxmox.com/
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2025-10-20 16:50 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
2025-10-15 14:31 ` Fiona Ebner
2025-10-16 9:07 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes Daniel Kral
2025-10-16 10:32 ` Fiona Ebner
2025-10-16 15:34 ` Daniel Kral
2025-10-17 10:55 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache Daniel Kral
2025-10-16 11:12 ` Fiona Ebner
2025-10-16 15:15 ` Daniel Kral
2025-10-17 10:02 ` Fiona Ebner
2025-10-17 10:08 ` Fiona Ebner
2025-10-17 16:18 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery Daniel Kral
2025-10-16 11:25 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node " Daniel Kral
2025-10-16 11:33 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started Daniel Kral
2025-10-16 11:39 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class Daniel Kral
2025-10-17 11:14 ` Fiona Ebner
2025-10-17 15:46 ` Daniel Kral
2025-10-20 15:18 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper Daniel Kral
2025-10-17 11:25 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations Daniel Kral
2025-10-17 11:57 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular Daniel Kral
2025-10-17 12:32 ` Fiona Ebner
2025-10-17 16:07 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular Daniel Kral
2025-10-17 12:42 ` Fiona Ebner
2025-10-17 15:59 ` Daniel Kral
2025-10-20 16:50 ` [pve-devel] superseded: [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox