From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id CEF1A9EF8 for ; Wed, 27 Apr 2022 12:20:03 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C4A95250F9 for ; Wed, 27 Apr 2022 12:20:03 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id B1C28250F0 for ; Wed, 27 Apr 2022 12:20:02 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 82CDE42DD5 for ; Wed, 27 Apr 2022 12:20:02 +0200 (CEST) From: =?UTF-8?q?Fabian=20Gr=C3=BCnbichler?= To: pve-devel@lists.proxmox.com Date: Wed, 27 Apr 2022 12:19:55 +0200 Message-Id: <20220427101955.3550677-1-f.gruenbichler@proxmox.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.171 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] [PATCH ha-manager] lrm: fix getting stuck on restart X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Apr 2022 10:20:03 -0000 run_workers is responsible for updating the state after workers have exited. if the current LRM state is 'active', but a shutdown_request was issued in 'restart' mode (like on package upgrades), this call is the only one made in the LRM work() loop. skipping it if there are active services means the following sequence of events effectively keeps the LRM from restarting or making any progress: - start HA migration on node A - reload LRM on node A while migration is still running even once the migration is finished, the service count is still >= 1 since the LRM never calls run_workers (directly or via manage_resources), so the service having been migrated is never noticed. maintenance mode (i.e., rebooting the node with shutdown policy migrate) does call manage_resources and thus run_workers, and will proceed once the last worker has exited. reported by a user: https://forum.proxmox.com/threads/lrm-hangs-when-updating-while-migration-is-running.108628 Signed-off-by: Fabian Grünbichler --- better viewed with -w ;) src/PVE/HA/LRM.pm | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm index 7e635e6..8cbdb82 100644 --- a/src/PVE/HA/LRM.pm +++ b/src/PVE/HA/LRM.pm @@ -420,18 +420,17 @@ sub work { if ($self->{shutdown_request}) { if ($self->{mode} eq 'restart') { - + # catch exited workers to update service state + my $workers = $self->run_workers(); my $service_count = $self->active_service_count(); - if ($service_count == 0) { - if ($self->run_workers() == 0) { - # safety: no active services or workers -> OK - give_up_watchdog_protection($self); - $shutdown = 1; + if ($service_count == 0 && $workers == 0) { + # safety: no active services or workers -> OK + give_up_watchdog_protection($self); + $shutdown = 1; - # restart with no or freezed services, release the lock - $haenv->release_ha_agent_lock(); - } + # restart with no or freezed services, release the lock + $haenv->release_ha_agent_lock(); } } else { -- 2.30.2