From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id CA88F1FF13B for ; Wed, 22 Apr 2026 12:01:55 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6908113C47; Wed, 22 Apr 2026 12:01:48 +0200 (CEST) From: Daniel Kral To: pve-devel@lists.proxmox.com Subject: [PATCH ha-manager 7/7] manager: try multiple priority classes when applying negative resource affinity Date: Wed, 22 Apr 2026 12:00:25 +0200 Message-ID: <20260422100035.232716-8-d.kral@proxmox.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260422100035.232716-1-d.kral@proxmox.com> References: <20260422100035.232716-1-d.kral@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776851952042 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.079 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: EHPKVI2IWAHFZNNB4PDCLEWP624PTTQS X-Message-ID-Hash: EHPKVI2IWAHFZNNB4PDCLEWP624PTTQS X-MailFrom: d.kral@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: select_service_node() only considers the nodes from the highest priority class in a node affinity rule, which has at least one available node. If an HA resource does not have any node affinity rule, the highest priority class is the set of all online nodes. get_node_affinity() already removes nodes, which are considered as not online, i.e., currently offline nodes or nodes in maintenance mode. Negative resource affinity rules introduced a new reason why nodes become unavailable to specific HA resources: another HA resource is already running on the node, which this specific HA resource must not share the node with. Therefore, try the succeeding priority classes from highest to lowest priority until one of them results in a non-empty node set or if there are no priority classes left, an empty node set. This reduces the amount of cases, where select_service_node() reduces no node at all and therefore the HA Manager making no change to the HA resources' node placement, even though it is warranted. This change is also done when generating migration candidates for the load balancer, which might allow to find better balancing migrations in certain highly constrained scenarios. As seen in "test-resource-affinity-with-node-affinity-strict-negative3", this can also lead the HA Manager to make more abrupt decisions in certain highly constrained scenarios, though the end state is still valid with the semantics of non-strict node affinity rules. Nonetheless, handling negative affinity rules in these scenarios should be improved in the future. Signed-off-by: Daniel Kral --- src/PVE/HA/Manager.pm | 21 ++++++++++++++-- .../README | 5 ++-- .../log.expect | 25 ++++++++++++++----- .../README | 7 +++--- .../log.expect | 16 ++++++------ 5 files changed, 54 insertions(+), 20 deletions(-) diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm index 0d7a2f59..5e0439f3 100644 --- a/src/PVE/HA/Manager.pm +++ b/src/PVE/HA/Manager.pm @@ -199,7 +199,16 @@ sub get_resource_migration_candidates { my $target_nodes = shift @$priority_classes // {}; my ($together, $separate) = get_resource_affinity($resource_affinity, $leader_sid, $ss, $online_nodes); - apply_negative_resource_affinity($separate, $target_nodes); + + # do not consider nodes where HA resources from a possible negative resource + # affinity rule are running on. + # as such a negative resource affinity could end up emptying the current + # priority class, try the succeeding priority classes which result in a + # non-empty node set or else end up with an empty set. + do { + apply_negative_resource_affinity($separate, $target_nodes); + } while (keys %$target_nodes < 1 && ($target_nodes = shift @$priority_classes)); + $target_nodes = {} if !defined($target_nodes); delete $target_nodes->{$current_leader_node}; @@ -354,7 +363,15 @@ sub select_service_node { } } - apply_negative_resource_affinity($separate, $pri_nodes); + # do not consider nodes where HA resources from a possible negative resource + # affinity rule are running on. + # as such a negative resource affinity could end up emptying the current + # priority class, try the succeeding priority classes which result in a + # non-empty node set or else end up with an empty set. + do { + apply_negative_resource_affinity($separate, $pri_nodes); + } while (keys %$pri_nodes < 1 && ($pri_nodes = shift @$priority_classes)); + $pri_nodes = {} if !defined($pri_nodes); # fallback to the previous maintenance node if it is available again. # diff --git a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README index c6a11cec..e1fc0d04 100644 --- a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README +++ b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/README @@ -4,5 +4,6 @@ - in a non-strict node affinity rule to node2 and node3 (equal priority), and - in a strict negative resource affinity rule with each other. -Tests whether the HA resource on node3 will stay there, even though node3 is -put in maintenance mode, because it cannot find any replacement node. +Tests whether the HA resource on node3 will correctly move to a replacement +node, which is different from the node of the other HA resource (node2), and +moves back to its previous maintenance node as soon as it's available again. diff --git a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect index 8899f782..1fc25206 100644 --- a/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect +++ b/src/test/test-resource-affinity-with-node-affinity-maintenance-strict-negative1/log.expect @@ -30,12 +30,25 @@ info 25 node3/lrm: service status vm:102 started info 120 cmdlist: execute crm node3 enable-node-maintenance info 125 node3/lrm: status change active => maintenance info 140 node1/crm: node 'node3': state changed from 'online' => 'maintenance' -warn 140 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance -warn 160 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance -warn 180 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance -warn 200 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance +info 140 node1/crm: migrate service 'vm:102' to node 'node1' (running) +info 140 node1/crm: service 'vm:102': state changed from 'started' to 'migrate' (node = node3, target = node1) +info 141 node1/lrm: got lock 'ha_agent_node1_lock' +info 141 node1/lrm: status change wait_for_agent_lock => active +info 145 node3/lrm: service vm:102 - start migrate to node 'node1' +info 145 node3/lrm: service vm:102 - end migrate to node 'node1' +info 160 node1/crm: service 'vm:102': state changed from 'migrate' to 'started' (node = node1) +info 161 node1/lrm: starting service vm:102 +info 161 node1/lrm: service status vm:102 started info 220 cmdlist: execute crm node3 disable-node-maintenance -warn 220 node1/crm: service 'vm:102': cannot find a replacement node while its current node is in maintenance +info 225 node3/lrm: got lock 'ha_agent_node3_lock' +info 225 node3/lrm: status change maintenance => active info 240 node1/crm: node 'node3': state changed from 'maintenance' => 'online' -info 240 node1/crm: service 'vm:102': clearing stale maintenance node 'node3' setting (is current node) +info 240 node1/crm: moving service 'vm:102' back to 'node3', node came back from maintenance. +info 240 node1/crm: migrate service 'vm:102' to node 'node3' (running) +info 240 node1/crm: service 'vm:102': state changed from 'started' to 'migrate' (node = node1, target = node3) +info 241 node1/lrm: service vm:102 - start migrate to node 'node3' +info 241 node1/lrm: service vm:102 - end migrate to node 'node3' +info 260 node1/crm: service 'vm:102': state changed from 'migrate' to 'started' (node = node3) +info 265 node3/lrm: starting service vm:102 +info 265 node3/lrm: service status vm:102 started info 820 hardware: exit simulation - done diff --git a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README index 062fc665..581b0e9a 100644 --- a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README +++ b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/README @@ -1,7 +1,8 @@ Test whether a strict negative resource affinity rule among three resources, -where two resources are restricted each to nodes they are not yet on, can be -exchanged to the nodes described by their node affinity rules, if one of the -resources is stopped. +where all resources are restricted each to nodes they are not yet on, can be +exchanged to the nodes described by their node affinity rules or fallback to +another valid configuration within the semantics of non-strict node affinity +rules, if one of the resources is stopped. The test scenario is: - vm:101, vm:102, and vm:103 should be on node2, node3 or node1 respectively diff --git a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect index 1ed34c36..66974583 100644 --- a/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect +++ b/src/test/test-resource-affinity-with-node-affinity-strict-negative3/log.expect @@ -57,11 +57,13 @@ info 285 node3/lrm: service status vm:102 started info 320 cmdlist: execute service vm:101 started info 320 node1/crm: service 'vm:101': state changed from 'stopped' to 'request_start' (node = node1) info 320 node1/crm: service 'vm:101': state changed from 'request_start' to 'started' (node = node1) -info 320 node1/crm: migrate service 'vm:101' to node 'node2' (running) -info 320 node1/crm: service 'vm:101': state changed from 'started' to 'migrate' (node = node1, target = node2) -info 321 node1/lrm: service vm:101 - start migrate to node 'node2' -info 321 node1/lrm: service vm:101 - end migrate to node 'node2' -info 340 node1/crm: service 'vm:101': state changed from 'migrate' to 'started' (node = node2) -info 343 node2/lrm: starting service vm:101 -info 343 node2/lrm: service status vm:101 started +info 320 node1/crm: migrate service 'vm:103' to node 'node2' (running) +info 320 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node1, target = node2) +info 321 node1/lrm: starting service vm:101 +info 321 node1/lrm: service status vm:101 started +info 321 node1/lrm: service vm:103 - start migrate to node 'node2' +info 321 node1/lrm: service vm:103 - end migrate to node 'node2' +info 340 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node2) +info 343 node2/lrm: starting service vm:103 +info 343 node2/lrm: service status vm:103 started info 920 hardware: exit simulation - done -- 2.47.3