From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 1BB511FF13B for ; Wed, 22 Apr 2026 12:00:50 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 0179E1320E; Wed, 22 Apr 2026 12:00:45 +0200 (CEST) From: Daniel Kral To: pve-devel@lists.proxmox.com Subject: [PATCH ha-manager 1/7] manager: warn if HA resources cannot be moved away from maintenance node Date: Wed, 22 Apr 2026 12:00:19 +0200 Message-ID: <20260422100035.232716-2-d.kral@proxmox.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260422100035.232716-1-d.kral@proxmox.com> References: <20260422100035.232716-1-d.kral@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776851951606 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.079 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [manager.pm] Message-ID-Hash: 4H6GZ7BIXA3OSIBQHBCMZCWMI5XCBGOG X-Message-ID-Hash: 4H6GZ7BIXA3OSIBQHBCMZCWMI5XCBGOG X-MailFrom: d.kral@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: There are scenarios, where an HA resource cannot be moved away from its current node when it is in maintenance mode. Previously, this could only happen in an edge case if the whole cluster was shutdown at the same time while the 'migrate' policy was configured, but with affinity rules it is much easier to run into such a scenarios. While some of these affinity-related scenarios need to be resolved in a better way, admins should always be warned of such a situation. Signed-off-by: Daniel Kral --- src/PVE/HA/Manager.pm | 20 +++++++++++++++---- .../test-stale-maintenance-node/log.expect | 3 +++ 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm index b69a6bba..684244e1 100644 --- a/src/PVE/HA/Manager.pm +++ b/src/PVE/HA/Manager.pm @@ -1562,16 +1562,28 @@ sub next_state_started { my $node_state = $ns->get_node_state($sd->{node}); if ($node_state eq 'online') { # Having the maintenance node set here means that the service was never - # started on a different node since it was set. This can happen in the edge - # case that the whole cluster is shut down at the same time while the - # 'migrate' policy was configured. Node is not in maintenance mode anymore - # and service is started on this node, so it's fine to clear the setting. + # started on a different node since it was set. + # + # This can happen if: + # - select_service_node(...) could not find any replacement node for the + # service while its current node was in maintenance mode, or + # - the whole cluster was shut down at the same time while the 'migrate' + # policy was configured. + # + # Node is not in maintenance mode anymore and service is started on this + # node, so it's fine to clear the setting. $haenv->log( 'info', "service '$sid': clearing stale maintenance node " . "'$sd->{maintenance_node}' setting (is current node)", ); delete $sd->{maintenance_node}; + } else { + $haenv->log( + 'warning', + "service '$sid': cannot find a replacement node while" + . " its current node is in maintenance", + ); } } diff --git a/src/test/test-stale-maintenance-node/log.expect b/src/test/test-stale-maintenance-node/log.expect index 092db8be..fce96fd4 100644 --- a/src/test/test-stale-maintenance-node/log.expect +++ b/src/test/test-stale-maintenance-node/log.expect @@ -33,6 +33,7 @@ info 120 node3/lrm: shutdown LRM, doing maintenance, removing this node fr info 120 node1/crm: node 'node1': state changed from 'online' => 'maintenance' info 120 node1/crm: node 'node2': state changed from 'online' => 'maintenance' info 120 node1/crm: node 'node3': state changed from 'online' => 'maintenance' +warn 120 node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance info 121 node1/lrm: status change active => maintenance info 124 node2/lrm: exit (loop end) info 124 shutdown: execute crm node2 stop @@ -40,6 +41,7 @@ info 123 node2/crm: server received shutdown request info 126 node3/lrm: exit (loop end) info 126 shutdown: execute crm node3 stop info 125 node3/crm: server received shutdown request +warn 140 node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance info 143 node2/crm: exit (loop end) info 143 shutdown: execute power node2 off info 144 node3/crm: exit (loop end) @@ -64,6 +66,7 @@ info 220 cmdlist: execute power node3 on info 220 node3/crm: status change startup => wait_for_quorum info 220 node3/lrm: status change startup => wait_for_agent_lock info 220 node1/crm: status change wait_for_quorum => master +warn 220 node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance info 221 node1/lrm: status change wait_for_agent_lock => active info 221 node1/lrm: starting service vm:103 info 221 node1/lrm: service status vm:103 started -- 2.47.3