From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id E74C4A039D for ; Mon, 12 Jun 2023 17:27:45 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C081225CFE for ; Mon, 12 Jun 2023 17:27:15 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Mon, 12 Jun 2023 17:27:14 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 5F2834426F for ; Mon, 12 Jun 2023 17:27:14 +0200 (CEST) From: Fiona Ebner To: pve-devel@lists.proxmox.com Date: Mon, 12 Jun 2023 17:27:11 +0200 Message-Id: <20230612152711.97078-2-f.ebner@proxmox.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230612152711.97078-1-f.ebner@proxmox.com> References: <20230612152711.97078-1-f.ebner@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.047 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: [pve-devel] [PATCH v2 ha-manager 2/2] manager: clear stale maintenance node caused by simultaneous cluster shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jun 2023 15:27:45 -0000 Currently, the maintenance node for a service is only cleared when the service is started on another node. In the edge case of a simultaneous cluster shutdown however, it might be that the service never was started anywhere else after the maintenance node was recorded, because the other nodes were already in the process of being shut down too. If a user ends up in this edge case, it would be rather surprising that the service would be automatically migrated back to the "maintenance node" which actually is not in maintenance mode anymore after a migration away from it. Signed-off-by: Fiona Ebner --- Changes in v2: * Rebase on current master. * Split out introducing the test into a dedicated patch. src/PVE/HA/Manager.pm | 18 ++++++++++++++++++ .../test-stale-maintenance-node/log.expect | 11 +++-------- 2 files changed, 21 insertions(+), 8 deletions(-) diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm index 7cfc402..6a13360 100644 --- a/src/PVE/HA/Manager.pm +++ b/src/PVE/HA/Manager.pm @@ -917,6 +917,24 @@ sub next_state_started { ) ); } + + if ($sd->{maintenance_node} && $sd->{node} eq $sd->{maintenance_node}) { + my $node_state = $ns->get_node_state($sd->{node}); + if ($node_state eq 'online') { + # Having the maintenance node set here means that the service was never + # started on a different node since it was set. This can happen in the edge + # case that the whole cluster is shut down at the same time while the + # 'migrate' policy was configured. Node is not in maintenance mode anymore + # and service is started on this node, so it's fine to clear the setting. + $haenv->log( + 'info', + "service '$sid': clearing stale maintenance node " + ."'$sd->{maintenance_node}' setting (is current node)", + ); + delete $sd->{maintenance_node}; + } + } + # ensure service get started again if it went unexpected down # but ensure also no LRM result gets lost $sd->{uid} = compute_new_uuid($sd->{state}) if defined($lrm_res); diff --git a/src/test/test-stale-maintenance-node/log.expect b/src/test/test-stale-maintenance-node/log.expect index ac944ca..63f12bd 100644 --- a/src/test/test-stale-maintenance-node/log.expect +++ b/src/test/test-stale-maintenance-node/log.expect @@ -72,6 +72,7 @@ info 224 node3/crm: status change wait_for_quorum => slave info 240 node1/crm: node 'node1': state changed from 'maintenance' => 'online' info 240 node1/crm: node 'node2': state changed from 'maintenance' => 'online' info 240 node1/crm: node 'node3': state changed from 'maintenance' => 'online' +info 240 node1/crm: service 'vm:103': clearing stale maintenance node 'node1' setting (is current node) info 320 cmdlist: execute service vm:103 migrate node3 info 320 node1/crm: got crm command: migrate vm:103 node3 info 320 node1/crm: migrate service 'vm:103' to node 'node3' @@ -79,14 +80,8 @@ info 320 node1/crm: service 'vm:103': state changed from 'started' to 'mig info 321 node1/lrm: service vm:103 - start migrate to node 'node3' info 321 node1/lrm: service vm:103 - end migrate to node 'node3' info 340 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node3) -info 340 node1/crm: moving service 'vm:103' back to 'node1', node came back from maintenance. -info 340 node1/crm: migrate service 'vm:103' to node 'node1' (running) -info 340 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node3, target = node1) info 345 node3/lrm: got lock 'ha_agent_node3_lock' info 345 node3/lrm: status change wait_for_agent_lock => active -info 345 node3/lrm: service vm:103 - start migrate to node 'node1' -info 345 node3/lrm: service vm:103 - end migrate to node 'node1' -info 360 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node1) -info 361 node1/lrm: starting service vm:103 -info 361 node1/lrm: service status vm:103 started +info 345 node3/lrm: starting service vm:103 +info 345 node3/lrm: service status vm:103 started info 920 hardware: exit simulation - done -- 2.39.2