public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [RFC ha-manager] manage: handle edge case where a node gets stuck in 'fence' state
@ 2021-10-08 12:52 Fabian Ebner
  2022-01-19 13:36 ` [pve-devel] applied: " Thomas Lamprecht
  0 siblings, 1 reply; 2+ messages in thread
From: Fabian Ebner @ 2021-10-08 12:52 UTC (permalink / raw)
  To: pve-devel

If all services in 'fence' state are gone from a node (e.g. by
removing the services) before fence_node() was successful, a node
would get stuck in the 'fence' state. Avoid this by calling
fence_node() if the node is in 'fence' state, regardless of service
state.

Reported in the community forum:
https://forum.proxmox.com/threads/ha-migration-stuck-is-doing-nothing.94469/

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

Not really sure if this is worth it, because it's a hard to reach edge
case, but AFAICT there is no good way to get out of being stuck. What
would work is either of:
    * Manually correcting the node state.
    * Adding a service to the stuck node and triggering a fence
      situation.

An alternative would be to keep services in 'fence' state in the
manager state, even if they were removed from the config. But the
approach from this patch seemed a bit more robust: for example, it
will fix an already existing stuck state, rather than just avoid
creating one.

 src/PVE/HA/Manager.pm | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 1c66b43..fc445b1 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -472,6 +472,14 @@ sub manage {
 	    $repeat = 1; # for faster execution
 	}
 
+	# Avoid that a node without services in 'fence' state gets stuck in 'fence' state.
+	for my $node (sort keys $ns->{status}->%*) {
+	    next if $ns->get_node_state($node) ne 'fence';
+	    next if defined($fenced_nodes->{$node});
+
+	    $fenced_nodes->{$node} = $ns->fence_node($node) || 0;
+	}
+
 	last if !$repeat;
     }
 
-- 
2.30.2





^ permalink raw reply	[flat|nested] 2+ messages in thread

* [pve-devel] applied: [RFC ha-manager] manage: handle edge case where a node gets stuck in 'fence' state
  2021-10-08 12:52 [pve-devel] [RFC ha-manager] manage: handle edge case where a node gets stuck in 'fence' state Fabian Ebner
@ 2022-01-19 13:36 ` Thomas Lamprecht
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Lamprecht @ 2022-01-19 13:36 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Ebner

On 08.10.21 14:52, Fabian Ebner wrote:
> If all services in 'fence' state are gone from a node (e.g. by
> removing the services) before fence_node() was successful, a node
> would get stuck in the 'fence' state. Avoid this by calling
> fence_node() if the node is in 'fence' state, regardless of service
> state.
> 
> Reported in the community forum:
> https://forum.proxmox.com/threads/ha-migration-stuck-is-doing-nothing.94469/
> 
> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
> ---
> 
> Not really sure if this is worth it, because it's a hard to reach edge
> case, but AFAICT there is no good way to get out of being stuck. What
> would work is either of:
>     * Manually correcting the node state.
>     * Adding a service to the stuck node and triggering a fence
>       situation.
> 
> An alternative would be to keep services in 'fence' state in the
> manager state, even if they were removed from the config. But the
> approach from this patch seemed a bit more robust: for example, it
> will fix an already existing stuck state, rather than just avoid
> creating one.
> 
>  src/PVE/HA/Manager.pm | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
>

applied, thanks!

As also discussed off-list I noticed a related issue to a derived edge-case,
that could cause trouble too. Spent some time in coming up with two tests
covering your fixed situation plus also mine, expanding the capabilities of
the test/simulation system slightly.

https://git.proxmox.com/?p=pve-ha-manager.git;a=commit;h=ca2e547a7662467f9a08c54fa15b46825e3702e6
https://git.proxmox.com/?p=pve-ha-manager.git;a=commit;h=30fc7ceedb7f3047659f22d063cc16c94c20dd7a




^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-01-19 13:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-08 12:52 [pve-devel] [RFC ha-manager] manage: handle edge case where a node gets stuck in 'fence' state Fabian Ebner
2022-01-19 13:36 ` [pve-devel] applied: " Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal