public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH ha-manager] lrm: fix getting stuck on restart
@ 2022-04-27 10:19 Fabian Grünbichler
  2022-04-27 12:00 ` [pve-devel] applied: " Thomas Lamprecht
  0 siblings, 1 reply; 2+ messages in thread
From: Fabian Grünbichler @ 2022-04-27 10:19 UTC (permalink / raw)
  To: pve-devel

run_workers is responsible for updating the state after workers have
exited. if the current LRM state is 'active', but a shutdown_request was
issued in 'restart' mode (like on package upgrades), this call is the
only one made in the LRM work() loop.

skipping it if there are active services means the following sequence of
events effectively keeps the LRM from restarting or making any progress:

- start HA migration on node A
- reload LRM on node A while migration is still running

even once the migration is finished, the service count is still >= 1
since the LRM never calls run_workers (directly or via
manage_resources), so the service having been migrated is never noticed.

maintenance mode (i.e., rebooting the node with shutdown policy migrate)
does call manage_resources and thus run_workers, and will proceed once
the last worker has exited.

reported by a user:

https://forum.proxmox.com/threads/lrm-hangs-when-updating-while-migration-is-running.108628

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
better viewed with -w ;)

 src/PVE/HA/LRM.pm | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index 7e635e6..8cbdb82 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -420,18 +420,17 @@ sub work {
 	    if ($self->{shutdown_request}) {
 
 		if ($self->{mode} eq 'restart') {
-
+		    # catch exited workers to update service state
+		    my $workers = $self->run_workers();
 		    my $service_count = $self->active_service_count();
 
-		    if ($service_count == 0) {
-			if ($self->run_workers() == 0) {
-			    # safety: no active services or workers -> OK
-			    give_up_watchdog_protection($self);
-			    $shutdown = 1;
+		    if ($service_count == 0 && $workers == 0) {
+			# safety: no active services or workers -> OK
+			give_up_watchdog_protection($self);
+			$shutdown = 1;
 
-			    # restart with no or freezed services, release the lock
-			    $haenv->release_ha_agent_lock();
-			}
+			# restart with no or freezed services, release the lock
+			$haenv->release_ha_agent_lock();
 		    }
 		} else {
 
-- 
2.30.2





^ permalink raw reply	[flat|nested] 2+ messages in thread

* [pve-devel] applied: [PATCH ha-manager] lrm: fix getting stuck on restart
  2022-04-27 10:19 [pve-devel] [PATCH ha-manager] lrm: fix getting stuck on restart Fabian Grünbichler
@ 2022-04-27 12:00 ` Thomas Lamprecht
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Lamprecht @ 2022-04-27 12:00 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Grünbichler

On 27.04.22 12:19, Fabian Grünbichler wrote:
> run_workers is responsible for updating the state after workers have
> exited. if the current LRM state is 'active', but a shutdown_request was
> issued in 'restart' mode (like on package upgrades), this call is the
> only one made in the LRM work() loop.
> 
> skipping it if there are active services means the following sequence of
> events effectively keeps the LRM from restarting or making any progress:
> 
> - start HA migration on node A
> - reload LRM on node A while migration is still running
> 
> even once the migration is finished, the service count is still >= 1
> since the LRM never calls run_workers (directly or via
> manage_resources), so the service having been migrated is never noticed.
> 
> maintenance mode (i.e., rebooting the node with shutdown policy migrate)
> does call manage_resources and thus run_workers, and will proceed once
> the last worker has exited.
> 
> reported by a user:
> 
> https://forum.proxmox.com/threads/lrm-hangs-when-updating-while-migration-is-running.108628
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> better viewed with -w ;)
> 
>  src/PVE/HA/LRM.pm | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
>

good fix!

applied, thanks!




^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-04-27 12:00 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-27 10:19 [pve-devel] [PATCH ha-manager] lrm: fix getting stuck on restart Fabian Grünbichler
2022-04-27 12:00 ` [pve-devel] applied: " Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal