From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id CA88F1FF13B
	for <inbox@lore.proxmox.com>; Wed, 22 Apr 2026 12:01:55 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 6908113C47;
	Wed, 22 Apr 2026 12:01:48 +0200 (CEST)
From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH ha-manager 7/7] manager: try multiple priority classes when
 applying negative resource affinity
Date: Wed, 22 Apr 2026 12:00:25 +0200
Message-ID: <20260422100035.232716-8-d.kral@proxmox.com>
X-Mailer: git-send-email 2.47.3
In-Reply-To: <20260422100035.232716-1-d.kral@proxmox.com>
References: <20260422100035.232716-1-d.kral@proxmox.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1776851952042
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.079 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: EHPKVI2IWAHFZNNB4PDCLEWP624PTTQS
X-Message-ID-Hash: EHPKVI2IWAHFZNNB4PDCLEWP624PTTQS
X-MailFrom: d.kral@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

select_service_node() only considers the nodes from the highest priority
class in a node affinity rule, which has at least one available node.
If an HA resource does not have any node affinity rule, the highest
priority class is the set of all online nodes.

get_node_affinity() already removes nodes, which are considered as not
online, i.e., currently offline nodes or nodes in maintenance mode.

Negative resource affinity rules introduced a new reason why nodes
become unavailable to specific HA resources: another HA resource is
already running on the node, which this specific HA resource must not
share the node with.

Therefore, try the succeeding priority classes from highest to lowest
priority until one of them results in a non-empty node set or if there
are no priority classes left, an empty node set.

This reduces the amount of cases, where select_service_node() reduces no
node at all and therefore the HA Manager making no change to the HA
resources' node placement, even though it is warranted.

This change is also done when generating migration candidates for the
load balancer, which might allow to find better balancing migrations in
certain highly constrained scenarios.

As seen in "test-resource-affinity-with-node-affinity-strict-negative3",
this can also lead the HA Manager to make more abrupt decisions in
certain highly constrained scenarios, though the end state is still
valid with the semantics of non-strict node affinity rules. Nonetheless,
handling negative affinity rules in these scenarios should be improved
in the future.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
 src/PVE/HA/Manager.pm                         | 21 ++++++++++++++--
 .../README                                    |  5 ++--
 .../log.expect                                | 25 ++++++++++++++-----
 .../README                                    |  7 +++---
 .../log.expect                                | 16 ++++++------
 5 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 0d7a2f59..5e0439f3 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -199,7 +199,16 @@ sub get_resource_migration_candidates {
         my $target_nodes = shift @$priority_classes // {};
         my ($together, $separate) =
             get_resource_affinity($resource_affinity, $leader_sid, $ss, $online_nodes);
-        apply_negative_resource_affinity($separate, $target_nodes);
+
+        # do not consider nodes where HA resources from a possible negative resource
+        # affinity rule are running on.
+        # as such a negative resource affinity could end up emptying the current
+        # priority class, try the succeeding priority classes which result in a
+        # non-empty node set or else end up with an empty set.
+        do {
+            apply_negative_resource_affinity($separate, $target_nodes);
+        } while (keys %$target_nodes < 1 && ($target_nodes = shift @$priority_classes));
+        $target_nodes = {} if !defined($target_nodes);
 
         delete $target_nodes->{$current_leader_node};
 
@@ -354,7 +363,15 @@ sub select_service_node {
         }
     }
 
-    apply_negative_resource_affinity($separate, $pri_nodes);
+    # do not consider nodes where HA resources from a possible negative resource
+    # affinity rule are running on.
+    # as such a negative resource affinity could end up emptying the current
+    # priority class, try the succeeding priority classes which result in a
+    # non-empty node set or else end up with an empty set.
+    do {
+        apply_negative_resource_affinity($separate, $pri_nodes);
+    } while (keys %$pri_nodes < 1 && ($pri_nodes = shift @$priority_classes));
+    $pri_nodes = {} if !defined($pri_nodes);
 
     # fallback to the previous maintenance node if it is available again.
     #
diff --git a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
index c6a11cec..e1fc0d04 100644
--- a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
+++ b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README
@@ -4,5 +4,6 @@
 - in a non-strict node affinity rule to node2 and node3 (equal priority), and
 - in a strict negative resource affinity rule with each other.
 
-Tests whether the HA resource on node3 will stay there, even though node3 is
-put in maintenance mode, because it cannot find any replacement node.
+Tests whether the HA resource on node3 will correctly move to a replacement
+node, which is different from the node of the other HA resource (node2), and
+moves back to its previous maintenance node as soon as it's available again.
diff --git a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
index 8899f782..1fc25206 100644
--- a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
+++ b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect
@@ -30,12 +30,25 @@ info     25    node3/lrm: service status vm:102 started
 info    120      cmdlist: execute crm node3 enable-node-maintenance
 info    125    node3/lrm: status change active => maintenance
 info    140    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
-warn    140    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn    160    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn    180    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
-warn    200    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
+info    140    node1/crm: migrate service 'vm:102' to node 'node1' (running)
+info    140    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node3, target = node1)
+info    141    node1/lrm: got lock 'ha_agent_node1_lock'
+info    141    node1/lrm: status change wait_for_agent_lock => active
+info    145    node3/lrm: service vm:102 - start migrate to node 'node1'
+info    145    node3/lrm: service vm:102 - end migrate to node 'node1'
+info    160    node1/crm: service 'vm:102': state changed from 'migrate' to 'started'  (node = node1)
+info    161    node1/lrm: starting service vm:102
+info    161    node1/lrm: service status vm:102 started
 info    220      cmdlist: execute crm node3 disable-node-maintenance
-warn    220    node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance
+info    225    node3/lrm: got lock 'ha_agent_node3_lock'
+info    225    node3/lrm: status change maintenance => active
 info    240    node1/crm: node 'node3': state changed from 'maintenance' => 'online'
-info    240    node1/crm: service 'vm:102': clearing stale maintenance node 'node3' setting (is current node)
+info    240    node1/crm: moving service 'vm:102' back to 'node3', node came back from maintenance.
+info    240    node1/crm: migrate service 'vm:102' to node 'node3' (running)
+info    240    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node1, target = node3)
+info    241    node1/lrm: service vm:102 - start migrate to node 'node3'
+info    241    node1/lrm: service vm:102 - end migrate to node 'node3'
+info    260    node1/crm: service 'vm:102': state changed from 'migrate' to 'started'  (node = node3)
+info    265    node3/lrm: starting service vm:102
+info    265    node3/lrm: service status vm:102 started
 info    820     hardware: exit simulation - done
diff --git a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
index 062fc665..581b0e9a 100644
--- a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
+++ b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README
@@ -1,7 +1,8 @@
 Test whether a strict negative resource affinity rule among three resources,
-where two resources are restricted each to nodes they are not yet on, can be
-exchanged to the nodes described by their node affinity rules, if one of the
-resources is stopped.
+where all resources are restricted each to nodes they are not yet on, can be
+exchanged to the nodes described by their node affinity rules or fallback to
+another valid configuration within the semantics of non-strict node affinity
+rules, if one of the resources is stopped.
 
 The test scenario is:
 - vm:101, vm:102, and vm:103 should be on node2, node3 or node1 respectively
diff --git a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
index 1ed34c36..66974583 100644
--- a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
+++ b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect
@@ -57,11 +57,13 @@ info    285    node3/lrm: service status vm:102 started
 info    320      cmdlist: execute service vm:101 started
 info    320    node1/crm: service 'vm:101': state changed from 'stopped' to 'request_start'  (node = node1)
 info    320    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
-info    320    node1/crm: migrate service 'vm:101' to node 'node2' (running)
-info    320    node1/crm: service 'vm:101': state changed from 'started' to 'migrate'  (node = node1, target = node2)
-info    321    node1/lrm: service vm:101 - start migrate to node 'node2'
-info    321    node1/lrm: service vm:101 - end migrate to node 'node2'
-info    340    node1/crm: service 'vm:101': state changed from 'migrate' to 'started'  (node = node2)
-info    343    node2/lrm: starting service vm:101
-info    343    node2/lrm: service status vm:101 started
+info    320    node1/crm: migrate service 'vm:103' to node 'node2' (running)
+info    320    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node1, target = node2)
+info    321    node1/lrm: starting service vm:101
+info    321    node1/lrm: service status vm:101 started
+info    321    node1/lrm: service vm:103 - start migrate to node 'node2'
+info    321    node1/lrm: service vm:103 - end migrate to node 'node2'
+info    340    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node2)
+info    343    node2/lrm: starting service vm:103
+info    343    node2/lrm: service status vm:103 started
 info    920     hardware: exit simulation - done
-- 
2.47.3