From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 1BB511FF13B
	for <inbox@lore.proxmox.com>; Wed, 22 Apr 2026 12:00:50 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 0179E1320E;
	Wed, 22 Apr 2026 12:00:45 +0200 (CEST)
From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH ha-manager 1/7] manager: warn if HA resources cannot be moved
 away from maintenance node
Date: Wed, 22 Apr 2026 12:00:19 +0200
Message-ID: <20260422100035.232716-2-d.kral@proxmox.com>
X-Mailer: git-send-email 2.47.3
In-Reply-To: <20260422100035.232716-1-d.kral@proxmox.com>
References: <20260422100035.232716-1-d.kral@proxmox.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1776851951606
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.079 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was
 blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
 for more information. [manager.pm]
Message-ID-Hash: 4H6GZ7BIXA3OSIBQHBCMZCWMI5XCBGOG
X-Message-ID-Hash: 4H6GZ7BIXA3OSIBQHBCMZCWMI5XCBGOG
X-MailFrom: d.kral@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

There are scenarios, where an HA resource cannot be moved away from its
current node when it is in maintenance mode.

Previously, this could only happen in an edge case if the whole cluster
was shutdown at the same time while the 'migrate' policy was configured,
but with affinity rules it is much easier to run into such a scenarios.

While some of these affinity-related scenarios need to be resolved in a
better way, admins should always be warned of such a situation.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
 src/PVE/HA/Manager.pm                         | 20 +++++++++++++++----
 .../test-stale-maintenance-node/log.expect    |  3 +++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index b69a6bba..684244e1 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1562,16 +1562,28 @@ sub next_state_started {
                     my $node_state = $ns->get_node_state($sd->{node});
                     if ($node_state eq 'online') {
                         # Having the maintenance node set here means that the service was never
-                        # started on a different node since it was set. This can happen in the edge
-                        # case that the whole cluster is shut down at the same time while the
-                        # 'migrate' policy was configured. Node is not in maintenance mode anymore
-                        # and service is started on this node, so it's fine to clear the setting.
+                        # started on a different node since it was set.
+                        #
+                        # This can happen if:
+                        # - select_service_node(...) could not find any replacement node for the
+                        #   service while its current node was in maintenance mode, or
+                        # - the whole cluster was shut down at the same time while the 'migrate'
+                        #   policy was configured.
+                        #
+                        # Node is not in maintenance mode anymore and service is started on this
+                        # node, so it's fine to clear the setting.
                         $haenv->log(
                             'info',
                             "service '$sid': clearing stale maintenance node "
                                 . "'$sd->{maintenance_node}' setting (is current node)",
                         );
                         delete $sd->{maintenance_node};
+                    } else {
+                        $haenv->log(
+                            'warning',
+                            "service '$sid': cannot find a replacement node while"
+                                . " its current node is in maintenance",
+                        );
                     }
                 }
 
diff --git a/src/test/test-stale-maintenance-node/log.expect b/src/test/test-stale-maintenance-node/log.expect
index 092db8be..fce96fd4 100644
--- a/src/test/test-stale-maintenance-node/log.expect
+++ b/src/test/test-stale-maintenance-node/log.expect
@@ -33,6 +33,7 @@ info    120    node3/lrm: shutdown LRM, doing maintenance, removing this node fr
 info    120    node1/crm: node 'node1': state changed from 'online' => 'maintenance'
 info    120    node1/crm: node 'node2': state changed from 'online' => 'maintenance'
 info    120    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+warn    120    node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
 info    121    node1/lrm: status change active => maintenance
 info    124    node2/lrm: exit (loop end)
 info    124     shutdown: execute crm node2 stop
@@ -40,6 +41,7 @@ info    123    node2/crm: server received shutdown request
 info    126    node3/lrm: exit (loop end)
 info    126     shutdown: execute crm node3 stop
 info    125    node3/crm: server received shutdown request
+warn    140    node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
 info    143    node2/crm: exit (loop end)
 info    143     shutdown: execute power node2 off
 info    144    node3/crm: exit (loop end)
@@ -64,6 +66,7 @@ info    220      cmdlist: execute power node3 on
 info    220    node3/crm: status change startup => wait_for_quorum
 info    220    node3/lrm: status change startup => wait_for_agent_lock
 info    220    node1/crm: status change wait_for_quorum => master
+warn    220    node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
 info    221    node1/lrm: status change wait_for_agent_lock => active
 info    221    node1/lrm: starting service vm:103
 info    221    node1/lrm: service status vm:103 started
-- 
2.47.3