From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH ha-manager 1/7] manager: warn if HA resources cannot be moved away from maintenance node
Date: Wed, 22 Apr 2026 12:00:19 +0200 [thread overview]
Message-ID: <20260422100035.232716-2-d.kral@proxmox.com> (raw)
In-Reply-To: <20260422100035.232716-1-d.kral@proxmox.com>
There are scenarios, where an HA resource cannot be moved away from its
current node when it is in maintenance mode.
Previously, this could only happen in an edge case if the whole cluster
was shutdown at the same time while the 'migrate' policy was configured,
but with affinity rules it is much easier to run into such a scenarios.
While some of these affinity-related scenarios need to be resolved in a
better way, admins should always be warned of such a situation.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/Manager.pm | 20 +++++++++++++++----
.../test-stale-maintenance-node/log.expect | 3 +++
2 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index b69a6bba..684244e1 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1562,16 +1562,28 @@ sub next_state_started {
my $node_state = $ns->get_node_state($sd->{node});
if ($node_state eq 'online') {
# Having the maintenance node set here means that the service was never
- # started on a different node since it was set. This can happen in the edge
- # case that the whole cluster is shut down at the same time while the
- # 'migrate' policy was configured. Node is not in maintenance mode anymore
- # and service is started on this node, so it's fine to clear the setting.
+ # started on a different node since it was set.
+ #
+ # This can happen if:
+ # - select_service_node(...) could not find any replacement node for the
+ # service while its current node was in maintenance mode, or
+ # - the whole cluster was shut down at the same time while the 'migrate'
+ # policy was configured.
+ #
+ # Node is not in maintenance mode anymore and service is started on this
+ # node, so it's fine to clear the setting.
$haenv->log(
'info',
"service '$sid': clearing stale maintenance node "
. "'$sd->{maintenance_node}' setting (is current node)",
);
delete $sd->{maintenance_node};
+ } else {
+ $haenv->log(
+ 'warning',
+ "service '$sid': cannot find a replacement node while"
+ . " its current node is in maintenance",
+ );
}
}
diff --git a/src/test/test-stale-maintenance-node/log.expect b/src/test/test-stale-maintenance-node/log.expect
index 092db8be..fce96fd4 100644
--- a/src/test/test-stale-maintenance-node/log.expect
+++ b/src/test/test-stale-maintenance-node/log.expect
@@ -33,6 +33,7 @@ info 120 node3/lrm: shutdown LRM, doing maintenance, removing this node fr
info 120 node1/crm: node 'node1': state changed from 'online' => 'maintenance'
info 120 node1/crm: node 'node2': state changed from 'online' => 'maintenance'
info 120 node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+warn 120 node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
info 121 node1/lrm: status change active => maintenance
info 124 node2/lrm: exit (loop end)
info 124 shutdown: execute crm node2 stop
@@ -40,6 +41,7 @@ info 123 node2/crm: server received shutdown request
info 126 node3/lrm: exit (loop end)
info 126 shutdown: execute crm node3 stop
info 125 node3/crm: server received shutdown request
+warn 140 node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
info 143 node2/crm: exit (loop end)
info 143 shutdown: execute power node2 off
info 144 node3/crm: exit (loop end)
@@ -64,6 +66,7 @@ info 220 cmdlist: execute power node3 on
info 220 node3/crm: status change startup => wait_for_quorum
info 220 node3/lrm: status change startup => wait_for_agent_lock
info 220 node1/crm: status change wait_for_quorum => master
+warn 220 node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
info 221 node1/lrm: status change wait_for_agent_lock => active
info 221 node1/lrm: starting service vm:103
info 221 node1/lrm: service status vm:103 started
--
2.47.3
next prev parent reply other threads:[~2026-04-22 10:00 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 10:00 [PATCH-SERIES ha-manager 0/7] improve handling of maintenance nodes Daniel Kral
2026-04-22 10:00 ` Daniel Kral [this message]
2026-04-22 10:00 ` [PATCH ha-manager 2/7] test: add test casses for node affinity rules with maintenance mode Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 3/7] test: add test cases for resource " Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 4/7] manager: make HA resources without failback move back to maintenance node Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 5/7] manager: make HA resource bundles " Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 6/7] make get_node_affinity return all priority classes sorted in descending order Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 7/7] manager: try multiple priority classes when applying negative resource affinity Daniel Kral
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260422100035.232716-2-d.kral@proxmox.com \
--to=d.kral@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox