public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH ha-manager 7/7] manager: try multiple priority classes when applying negative resource affinity
Date: Wed, 22 Apr 2026 12:00:25 +0200	[thread overview]
Message-ID: <20260422100035.232716-8-d.kral@proxmox.com> (raw)
In-Reply-To: <20260422100035.232716-1-d.kral@proxmox.com>

select_service_node() only considers the nodes from the highest priority
class in a node affinity rule, which has at least one available node.
If an HA resource does not have any node affinity rule, the highest
priority class is the set of all online nodes.

get_node_affinity() already removes nodes, which are considered as not
online, i.e., currently offline nodes or nodes in maintenance mode.

Negative resource affinity rules introduced a new reason why nodes
become unavailable to specific HA resources: another HA resource is
already running on the node, which this specific HA resource must not
share the node with.

Therefore, try the succeeding priority classes from highest to lowest
priority until one of them results in a non-empty node set or if there
are no priority classes left, an empty node set.

This reduces the amount of cases, where select_service_node() reduces no
node at all and therefore the HA Manager making no change to the HA
resources' node placement, even though it is warranted.

This change is also done when generating migration candidates for the
load balancer, which might allow to find better balancing migrations in
certain highly constrained scenarios.

As seen in "test-resource-affinity-with-node-affinity-strict-negative3",
this can also lead the HA Manager to make more abrupt decisions in
certain highly constrained scenarios, though the end state is still
valid with the semantics of non-strict node affinity rules. Nonetheless,
handling negative affinity rules in these scenarios should be improved
in the future.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
 src/PVE/HA/Manager.pm                         | 21 ++++++++++++++--
 .../README                                    |  5 ++--
 .../log.expect                                | 25 ++++++++++++++-----
 .../README                                    |  7 +++---
 .../log.expect                                | 16 ++++++------
 5 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 0d7a2f59..5e0439f3 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -199,7 +199,16 @@ sub get_resource_migration_candidates {
         my $target_nodes = shift @$priority_classes // {};
         my ($together, $separate) =
             get_resource_affinity($resource_affinity, $leader_sid, $ss, $online_nodes);
-        apply_negative_resource_affinity($separate, $target_nodes);
+
+        # do not consider nodes where HA resources from a possible negative resource
+        # affinity rule are running on.
+        # as such a negative resource affinity could end up emptying the current
+        # priority class, try the succeeding priority classes which result in a
+        # non-empty node set or else end up with an empty set.
+        do {
+            apply_negative_resource_affinity($separate, $target_nodes);
+        } while (keys %$target_nodes < 1 && ($target_nodes = shift @$priority_classes));
+        $target_nodes = {} if !defined($target_nodes);
 
         delete $target_nodes->{$current_leader_node};
 
@@ -354,7 +363,15 @@ sub select_service_node {
         }
     }
 
-    apply_negative_resource_affinity($separate, $pri_nodes);
+    # do not consider nodes where HA resources from a possible negative resource
+    # affinity rule are running on.
+    # as such a negative resource affinity could end up emptying the current
+    # priority class, try the succeeding priority classes which result in a
+    # non-empty node set or else end up with an empty set.
+    do {
+        apply_negative_resource_affinity($separate, $pri_nodes);
+    } while (keys %$pri_nodes < 1 && ($pri_nodes = shift @$priority_classes));
+    $pri_nodes = {} if !defined($pri_nodes);
 
     # fallback to the previous maintenance node if it is available again.
     #
diff --git a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
index c6a11cec..e1fc0d04 100644
--- a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
+++ b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
@@ -4,5 +4,6 @@
 - in a non-strict node affinity rule to node2 and node3 (equal priority), and
 - in a strict negative resource affinity rule with each other.
 
-Tests whether the HA resource on node3 will stay there, even though node3 is
-put in maintenance mode, because it cannot find any replacement node.
+Tests whether the HA resource on node3 will correctly move to a replacement
+node, which is different from the node of the other HA resource (node2), and
+moves back to its previous maintenance node as soon as it's available again.
diff --git a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
index 8899f782..1fc25206 100644
--- a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
+++ b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
@@ -30,12 +30,25 @@ info     25    node3/lrm: service status vm:102 started
 info    120      cmdlist: execute crm node3 enable-node-maintenance
 info    125    node3/lrm: status change active => maintenance
 info    140    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
-warn    140    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn    160    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn    180    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn    200    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
+info    140    node1/crm: migrate service 'vm:102' to node 'node1' (running)
+info    140    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node3, target = node1)
+info    141    node1/lrm: got lock 'ha_agent_node1_lock'
+info    141    node1/lrm: status change wait_for_agent_lock => active
+info    145    node3/lrm: service vm:102 - start migrate to node 'node1'
+info    145    node3/lrm: service vm:102 - end migrate to node 'node1'
+info    160    node1/crm: service 'vm:102': state changed from 'migrate' to 'started'  (node = node1)
+info    161    node1/lrm: starting service vm:102
+info    161    node1/lrm: service status vm:102 started
 info    220      cmdlist: execute crm node3 disable-node-maintenance
-warn    220    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
+info    225    node3/lrm: got lock 'ha_agent_node3_lock'
+info    225    node3/lrm: status change maintenance => active
 info    240    node1/crm: node 'node3': state changed from 'maintenance' => 'online'
-info    240    node1/crm: service 'vm:102': clearing stale maintenance node 'node3' setting (is current node)
+info    240    node1/crm: moving service 'vm:102' back to 'node3', node came back from maintenance.
+info    240    node1/crm: migrate service 'vm:102' to node 'node3' (running)
+info    240    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node1, target = node3)
+info    241    node1/lrm: service vm:102 - start migrate to node 'node3'
+info    241    node1/lrm: service vm:102 - end migrate to node 'node3'
+info    260    node1/crm: service 'vm:102': state changed from 'migrate' to 'started'  (node = node3)
+info    265    node3/lrm: starting service vm:102
+info    265    node3/lrm: service status vm:102 started
 info    820     hardware: exit simulation - done
diff --git a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
index 062fc665..581b0e9a 100644
--- a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
+++ b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
@@ -1,7 +1,8 @@
 Test whether a strict negative resource affinity rule among three resources,
-where two resources are restricted each to nodes they are not yet on, can be
-exchanged to the nodes described by their node affinity rules, if one of the
-resources is stopped.
+where all resources are restricted each to nodes they are not yet on, can be
+exchanged to the nodes described by their node affinity rules or fallback to
+another valid configuration within the semantics of non-strict node affinity
+rules, if one of the resources is stopped.
 
 The test scenario is:
 - vm:101, vm:102, and vm:103 should be on node2, node3 or node1 respectively
diff --git a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
index 1ed34c36..66974583 100644
--- a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
+++ b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
@@ -57,11 +57,13 @@ info    285    node3/lrm: service status vm:102 started
 info    320      cmdlist: execute service vm:101 started
 info    320    node1/crm: service 'vm:101': state changed from 'stopped' to 'request_start'  (node = node1)
 info    320    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
-info    320    node1/crm: migrate service 'vm:101' to node 'node2' (running)
-info    320    node1/crm: service 'vm:101': state changed from 'started' to 'migrate'  (node = node1, target = node2)
-info    321    node1/lrm: service vm:101 - start migrate to node 'node2'
-info    321    node1/lrm: service vm:101 - end migrate to node 'node2'
-info    340    node1/crm: service 'vm:101': state changed from 'migrate' to 'started'  (node = node2)
-info    343    node2/lrm: starting service vm:101
-info    343    node2/lrm: service status vm:101 started
+info    320    node1/crm: migrate service 'vm:103' to node 'node2' (running)
+info    320    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node1, target = node2)
+info    321    node1/lrm: starting service vm:101
+info    321    node1/lrm: service status vm:101 started
+info    321    node1/lrm: service vm:103 - start migrate to node 'node2'
+info    321    node1/lrm: service vm:103 - end migrate to node 'node2'
+info    340    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node2)
+info    343    node2/lrm: starting service vm:103
+info    343    node2/lrm: service status vm:103 started
 info    920     hardware: exit simulation - done
-- 
2.47.3





      parent reply	other threads:[~2026-04-22 10:01 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-22 10:00 [PATCH-SERIES ha-manager 0/7] improve handling of maintenance nodes Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 1/7] manager: warn if HA resources cannot be moved away from maintenance node Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 2/7] test: add test casses for node affinity rules with maintenance mode Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 3/7] test: add test cases for resource " Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 4/7] manager: make HA resources without failback move back to maintenance node Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 5/7] manager: make HA resource bundles " Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 6/7] make get_node_affinity return all priority classes sorted in descending order Daniel Kral
2026-04-22 10:00 ` Daniel Kral [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260422100035.232716-8-d.kral@proxmox.com \
    --to=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal