From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH ha-manager 7/7] manager: try multiple priority classes when applying negative resource affinity
Date: Wed, 22 Apr 2026 12:00:25 +0200 [thread overview]
Message-ID: <20260422100035.232716-8-d.kral@proxmox.com> (raw)
In-Reply-To: <20260422100035.232716-1-d.kral@proxmox.com>
select_service_node() only considers the nodes from the highest priority
class in a node affinity rule, which has at least one available node.
If an HA resource does not have any node affinity rule, the highest
priority class is the set of all online nodes.
get_node_affinity() already removes nodes, which are considered as not
online, i.e., currently offline nodes or nodes in maintenance mode.
Negative resource affinity rules introduced a new reason why nodes
become unavailable to specific HA resources: another HA resource is
already running on the node, which this specific HA resource must not
share the node with.
Therefore, try the succeeding priority classes from highest to lowest
priority until one of them results in a non-empty node set or if there
are no priority classes left, an empty node set.
This reduces the amount of cases, where select_service_node() reduces no
node at all and therefore the HA Manager making no change to the HA
resources' node placement, even though it is warranted.
This change is also done when generating migration candidates for the
load balancer, which might allow to find better balancing migrations in
certain highly constrained scenarios.
As seen in "test-resource-affinity-with-node-affinity-strict-negative3",
this can also lead the HA Manager to make more abrupt decisions in
certain highly constrained scenarios, though the end state is still
valid with the semantics of non-strict node affinity rules. Nonetheless,
handling negative affinity rules in these scenarios should be improved
in the future.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 21 ++++++++++++++--
.../README | 5 ++--
.../log.expect | 25 ++++++++++++++-----
.../README | 7 +++---
.../log.expect | 16 ++++++------
5 files changed, 54 insertions(+), 20 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 0d7a2f59..5e0439f3 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -199,7 +199,16 @@ sub get_resource_migration_candidates {
my $target_nodes = shift @$priority_classes // {};
my ($together, $separate) =
get_resource_affinity($resource_affinity, $leader_sid, $ss, $online_nodes);
- apply_negative_resource_affinity($separate, $target_nodes);
+
+ # do not consider nodes where HA resources from a possible negative resource
+ # affinity rule are running on.
+ # as such a negative resource affinity could end up emptying the current
+ # priority class, try the succeeding priority classes which result in a
+ # non-empty node set or else end up with an empty set.
+ do {
+ apply_negative_resource_affinity($separate, $target_nodes);
+ } while (keys %$target_nodes < 1 && ($target_nodes = shift @$priority_classes));
+ $target_nodes = {} if !defined($target_nodes);
delete $target_nodes->{$current_leader_node};
@@ -354,7 +363,15 @@ sub select_service_node {
}
}
- apply_negative_resource_affinity($separate, $pri_nodes);
+ # do not consider nodes where HA resources from a possible negative resource
+ # affinity rule are running on.
+ # as such a negative resource affinity could end up emptying the current
+ # priority class, try the succeeding priority classes which result in a
+ # non-empty node set or else end up with an empty set.
+ do {
+ apply_negative_resource_affinity($separate, $pri_nodes);
+ } while (keys %$pri_nodes < 1 && ($pri_nodes = shift @$priority_classes));
+ $pri_nodes = {} if !defined($pri_nodes);
# fallback to the previous maintenance node if it is available again.
#
diff --git a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
index c6a11cec..e1fc0d04 100644
--- a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
+++ b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
@@ -4,5 +4,6 @@
- in a non-strict node affinity rule to node2 and node3 (equal priority), and
- in a strict negative resource affinity rule with each other.
-Tests whether the HA resource on node3 will stay there, even though node3 is
-put in maintenance mode, because it cannot find any replacement node.
+Tests whether the HA resource on node3 will correctly move to a replacement
+node, which is different from the node of the other HA resource (node2), and
+moves back to its previous maintenance node as soon as it's available again.
diff --git a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
index 8899f782..1fc25206 100644
--- a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
+++ b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
@@ -30,12 +30,25 @@ info 25 node3/lrm: service status vm:102 started
info 120 cmdlist: execute crm node3 enable-node-maintenance
info 125 node3/lrm: status change active => maintenance
info 140 node1/crm: node 'node3': state changed from 'online' => 'maintenance'
-warn 140 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn 160 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn 180 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn 200 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
+info 140 node1/crm: migrate service 'vm:102' to node 'node1' (running)
+info 140 node1/crm: service 'vm:102': state changed from 'started' to 'migrate' (node = node3, target = node1)
+info 141 node1/lrm: got lock 'ha_agent_node1_lock'
+info 141 node1/lrm: status change wait_for_agent_lock => active
+info 145 node3/lrm: service vm:102 - start migrate to node 'node1'
+info 145 node3/lrm: service vm:102 - end migrate to node 'node1'
+info 160 node1/crm: service 'vm:102': state changed from 'migrate' to 'started' (node = node1)
+info 161 node1/lrm: starting service vm:102
+info 161 node1/lrm: service status vm:102 started
info 220 cmdlist: execute crm node3 disable-node-maintenance
-warn 220 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
+info 225 node3/lrm: got lock 'ha_agent_node3_lock'
+info 225 node3/lrm: status change maintenance => active
info 240 node1/crm: node 'node3': state changed from 'maintenance' => 'online'
-info 240 node1/crm: service 'vm:102': clearing stale maintenance node 'node3' setting (is current node)
+info 240 node1/crm: moving service 'vm:102' back to 'node3', node came back from maintenance.
+info 240 node1/crm: migrate service 'vm:102' to node 'node3' (running)
+info 240 node1/crm: service 'vm:102': state changed from 'started' to 'migrate' (node = node1, target = node3)
+info 241 node1/lrm: service vm:102 - start migrate to node 'node3'
+info 241 node1/lrm: service vm:102 - end migrate to node 'node3'
+info 260 node1/crm: service 'vm:102': state changed from 'migrate' to 'started' (node = node3)
+info 265 node3/lrm: starting service vm:102
+info 265 node3/lrm: service status vm:102 started
info 820 hardware: exit simulation - done
diff --git a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
index 062fc665..581b0e9a 100644
--- a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
+++ b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
@@ -1,7 +1,8 @@
Test whether a strict negative resource affinity rule among three resources,
-where two resources are restricted each to nodes they are not yet on, can be
-exchanged to the nodes described by their node affinity rules, if one of the
-resources is stopped.
+where all resources are restricted each to nodes they are not yet on, can be
+exchanged to the nodes described by their node affinity rules or fallback to
+another valid configuration within the semantics of non-strict node affinity
+rules, if one of the resources is stopped.
The test scenario is:
- vm:101, vm:102, and vm:103 should be on node2, node3 or node1 respectively
diff --git a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
index 1ed34c36..66974583 100644
--- a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
+++ b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
@@ -57,11 +57,13 @@ info 285 node3/lrm: service status vm:102 started
info 320 cmdlist: execute service vm:101 started
info 320 node1/crm: service 'vm:101': state changed from 'stopped' to 'request_start' (node = node1)
info 320 node1/crm: service 'vm:101': state changed from 'request_start' to 'started' (node = node1)
-info 320 node1/crm: migrate service 'vm:101' to node 'node2' (running)
-info 320 node1/crm: service 'vm:101': state changed from 'started' to 'migrate' (node = node1, target = node2)
-info 321 node1/lrm: service vm:101 - start migrate to node 'node2'
-info 321 node1/lrm: service vm:101 - end migrate to node 'node2'
-info 340 node1/crm: service 'vm:101': state changed from 'migrate' to 'started' (node = node2)
-info 343 node2/lrm: starting service vm:101
-info 343 node2/lrm: service status vm:101 started
+info 320 node1/crm: migrate service 'vm:103' to node 'node2' (running)
+info 320 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node1, target = node2)
+info 321 node1/lrm: starting service vm:101
+info 321 node1/lrm: service status vm:101 started
+info 321 node1/lrm: service vm:103 - start migrate to node 'node2'
+info 321 node1/lrm: service vm:103 - end migrate to node 'node2'
+info 340 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node2)
+info 343 node2/lrm: starting service vm:103
+info 343 node2/lrm: service status vm:103 started
info 920 hardware: exit simulation - done
--
2.47.3
prev parent reply other threads:[~2026-04-22 10:01 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 10:00 [PATCH-SERIES ha-manager 0/7] improve handling of maintenance nodes Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 1/7] manager: warn if HA resources cannot be moved away from maintenance node Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 2/7] test: add test casses for node affinity rules with maintenance mode Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 3/7] test: add test cases for resource " Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 4/7] manager: make HA resources without failback move back to maintenance node Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 5/7] manager: make HA resource bundles " Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 6/7] make get_node_affinity return all priority classes sorted in descending order Daniel Kral
2026-04-22 10:00 ` Daniel Kral [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260422100035.232716-8-d.kral@proxmox.com \
--to=d.kral@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox