[PATCH ha-manager 1/7] manager: warn if HA resources cannot be moved away from maintenance node

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH ha-manager 1/7] manager: warn if HA resources cannot be moved away from maintenance node
Date: Wed, 22 Apr 2026 12:00:19 +0200	[thread overview]
Message-ID: <20260422100035.232716-2-d.kral@proxmox.com> (raw)
In-Reply-To: <20260422100035.232716-1-d.kral@proxmox.com>

There are scenarios, where an HA resource cannot be moved away from its
current node when it is in maintenance mode.

Previously, this could only happen in an edge case if the whole cluster
was shutdown at the same time while the 'migrate' policy was configured,
but with affinity rules it is much easier to run into such a scenarios.

While some of these affinity-related scenarios need to be resolved in a
better way, admins should always be warned of such a situation.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
 src/PVE/HA/Manager.pm                         | 20 +++++++++++++++----
 .../test-stale-maintenance-node/log.expect    |  3 +++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index b69a6bba..684244e1 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1562,16 +1562,28 @@ sub next_state_started {
                     my $node_state = $ns->get_node_state($sd->{node});
                     if ($node_state eq 'online') {
                         # Having the maintenance node set here means that the service was never
-                        # started on a different node since it was set. This can happen in the edge
-                        # case that the whole cluster is shut down at the same time while the
-                        # 'migrate' policy was configured. Node is not in maintenance mode anymore
-                        # and service is started on this node, so it's fine to clear the setting.
+                        # started on a different node since it was set.
+                        #
+                        # This can happen if:
+                        # - select_service_node(...) could not find any replacement node for the
+                        #   service while its current node was in maintenance mode, or
+                        # - the whole cluster was shut down at the same time while the 'migrate'
+                        #   policy was configured.
+                        #
+                        # Node is not in maintenance mode anymore and service is started on this
+                        # node, so it's fine to clear the setting.
                         $haenv->log(
                             'info',
                             "service '$sid': clearing stale maintenance node "
                                 . "'$sd->{maintenance_node}' setting (is current node)",
                         );
                         delete $sd->{maintenance_node};
+                    } else {
+                        $haenv->log(
+                            'warning',
+                            "service '$sid': cannot find a replacement node while"
+                                . " its current node is in maintenance",
+                        );
                     }
                 }
 
diff --git a/src/test/test-stale-maintenance-node/log.expect b/src/test/test-stale-maintenance-node/log.expect
index 092db8be..fce96fd4 100644
--- a/src/test/test-stale-maintenance-node/log.expect
+++ b/src/test/test-stale-maintenance-node/log.expect
@@ -33,6 +33,7 @@ info    120    node3/lrm: shutdown LRM, doing maintenance, removing this node fr
 info    120    node1/crm: node 'node1': state changed from 'online' => 'maintenance'
 info    120    node1/crm: node 'node2': state changed from 'online' => 'maintenance'
 info    120    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+warn    120    node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
 info    121    node1/lrm: status change active => maintenance
 info    124    node2/lrm: exit (loop end)
 info    124     shutdown: execute crm node2 stop
@@ -40,6 +41,7 @@ info    123    node2/crm: server received shutdown request
 info    126    node3/lrm: exit (loop end)
 info    126     shutdown: execute crm node3 stop
 info    125    node3/crm: server received shutdown request
+warn    140    node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
 info    143    node2/crm: exit (loop end)
 info    143     shutdown: execute power node2 off
 info    144    node3/crm: exit (loop end)
@@ -64,6 +66,7 @@ info    220      cmdlist: execute power node3 on
 info    220    node3/crm: status change startup => wait_for_quorum
 info    220    node3/lrm: status change startup => wait_for_agent_lock
 info    220    node1/crm: status change wait_for_quorum => master
+warn    220    node1/crm: service 'vm:103': cannot find a replacement node while its current node is in maintenance
 info    221    node1/lrm: status change wait_for_agent_lock => active
 info    221    node1/lrm: starting service vm:103
 info    221    node1/lrm: service status vm:103 started
-- 
2.47.3

next prev parent reply	other threads:[~2026-04-22 10:00 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-22 10:00 [PATCH-SERIES ha-manager 0/7] improve handling of maintenance nodes Daniel Kral
2026-04-22 10:00 ` Daniel Kral [this message]
2026-04-22 10:00 ` [PATCH ha-manager 2/7] test: add test casses for node affinity rules with maintenance mode Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 3/7] test: add test cases for resource " Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 4/7] manager: make HA resources without failback move back to maintenance node Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 5/7] manager: make HA resource bundles " Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 6/7] make get_node_affinity return all priority classes sorted in descending order Daniel Kral
2026-04-22 10:00 ` [PATCH ha-manager 7/7] manager: try multiple priority classes when applying negative resource affinity Daniel Kral

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260422100035.232716-2-d.kral@proxmox.com \
    --to=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal