public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Fiona Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH v2 ha-manager 2/2] manager: clear stale maintenance node caused by simultaneous cluster shutdown
Date: Mon, 12 Jun 2023 17:27:11 +0200	[thread overview]
Message-ID: <20230612152711.97078-2-f.ebner@proxmox.com> (raw)
In-Reply-To: <20230612152711.97078-1-f.ebner@proxmox.com>

Currently, the maintenance node for a service is only cleared when the
service is started on another node. In the edge case of a simultaneous
cluster shutdown however, it might be that the service never was
started anywhere else after the maintenance node was recorded, because
the other nodes were already in the process of being shut down too.

If a user ends up in this edge case, it would be rather surprising
that the service would be automatically migrated back to the
"maintenance node" which actually is not in maintenance mode anymore
after a migration away from it.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
    * Rebase on current master.
    * Split out introducing the test into a dedicated patch.

 src/PVE/HA/Manager.pm                          | 18 ++++++++++++++++++
 .../test-stale-maintenance-node/log.expect     | 11 +++--------
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 7cfc402..6a13360 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -917,6 +917,24 @@ sub next_state_started {
 			)
 		    );
 		}
+
+		if ($sd->{maintenance_node} && $sd->{node} eq $sd->{maintenance_node}) {
+		    my $node_state = $ns->get_node_state($sd->{node});
+		    if ($node_state eq 'online') {
+			# Having the maintenance node set here means that the service was never
+			# started on a different node since it was set. This can happen in the edge
+			# case that the whole cluster is shut down at the same time while the
+			# 'migrate' policy was configured. Node is not in maintenance mode anymore
+			# and service is started on this node, so it's fine to clear the setting.
+			$haenv->log(
+			    'info',
+			    "service '$sid': clearing stale maintenance node "
+				."'$sd->{maintenance_node}' setting (is current node)",
+			);
+			delete $sd->{maintenance_node};
+		    }
+		}
+
 		# ensure service get started again if it went unexpected down
 		# but ensure also no LRM result gets lost
 		$sd->{uid} = compute_new_uuid($sd->{state}) if defined($lrm_res);
diff --git a/src/test/test-stale-maintenance-node/log.expect b/src/test/test-stale-maintenance-node/log.expect
index ac944ca..63f12bd 100644
--- a/src/test/test-stale-maintenance-node/log.expect
+++ b/src/test/test-stale-maintenance-node/log.expect
@@ -72,6 +72,7 @@ info    224    node3/crm: status change wait_for_quorum => slave
 info    240    node1/crm: node 'node1': state changed from 'maintenance' => 'online'
 info    240    node1/crm: node 'node2': state changed from 'maintenance' => 'online'
 info    240    node1/crm: node 'node3': state changed from 'maintenance' => 'online'
+info    240    node1/crm: service 'vm:103': clearing stale maintenance node 'node1' setting (is current node)
 info    320      cmdlist: execute service vm:103 migrate node3
 info    320    node1/crm: got crm command: migrate vm:103 node3
 info    320    node1/crm: migrate service 'vm:103' to node 'node3'
@@ -79,14 +80,8 @@ info    320    node1/crm: service 'vm:103': state changed from 'started' to 'mig
 info    321    node1/lrm: service vm:103 - start migrate to node 'node3'
 info    321    node1/lrm: service vm:103 - end migrate to node 'node3'
 info    340    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node3)
-info    340    node1/crm: moving service 'vm:103' back to 'node1', node came back from maintenance.
-info    340    node1/crm: migrate service 'vm:103' to node 'node1' (running)
-info    340    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node3, target = node1)
 info    345    node3/lrm: got lock 'ha_agent_node3_lock'
 info    345    node3/lrm: status change wait_for_agent_lock => active
-info    345    node3/lrm: service vm:103 - start migrate to node 'node1'
-info    345    node3/lrm: service vm:103 - end migrate to node 'node1'
-info    360    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
-info    361    node1/lrm: starting service vm:103
-info    361    node1/lrm: service status vm:103 started
+info    345    node3/lrm: starting service vm:103
+info    345    node3/lrm: service status vm:103 started
 info    920     hardware: exit simulation - done
-- 
2.39.2





  reply	other threads:[~2023-06-12 15:27 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-12 15:27 [pve-devel] [PATCH v2 ha-manager 1/2] tests: simulate stale maintainance " Fiona Ebner
2023-06-12 15:27 ` Fiona Ebner [this message]
2023-06-13  7:00   ` [pve-devel] applied: [PATCH v2 ha-manager 2/2] manager: clear stale maintenance " Thomas Lamprecht
2023-06-13  7:00 ` [pve-devel] applied: [PATCH v2 ha-manager 1/2] tests: simulate stale maintainance " Thomas Lamprecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230612152711.97078-2-f.ebner@proxmox.com \
    --to=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal