* [PATCH docs 0/2] document disarm-ha, arm-ha and watchdog fencing status
@ 2026-03-10 15:47 Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 1/2] ha-manager: document fencing & watchdog status Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 2/2] ha-manager: document disarming and arming Thomas Lamprecht
0 siblings, 2 replies; 3+ messages in thread
From: Thomas Lamprecht @ 2026-03-10 15:47 UTC (permalink / raw)
To: pve-devel
This is the documentation part for the recently sent implementation:
https://lore.proxmox.com/all/20260309220128.973793-1-t.lamprecht@proxmox.com/
Thomas Lamprecht (2):
ha-manager: document fencing & watchdog status
ha-manager: document disarming and arming
ha-manager.adoc | 155 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 155 insertions(+)
--
2.47.3
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH docs 1/2] ha-manager: document fencing & watchdog status
2026-03-10 15:47 [PATCH docs 0/2] document disarm-ha, arm-ha and watchdog fencing status Thomas Lamprecht
@ 2026-03-10 15:47 ` Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 2/2] ha-manager: document disarming and arming Thomas Lamprecht
1 sibling, 0 replies; 3+ messages in thread
From: Thomas Lamprecht @ 2026-03-10 15:47 UTC (permalink / raw)
To: pve-devel
This accompanies the recent changes in the ha-manager's status API
endpoint to also include an explicit fencing/watchdog status.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
ha-manager.adoc | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/ha-manager.adoc b/ha-manager.adoc
index 4c318fb..ee254be 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -1003,6 +1003,34 @@ can lead to high load, especially on small clusters. Please design
your cluster so that it can handle such worst case scenarios.
+[[ha_manager_fencing_status]]
+Fencing & Watchdog Status
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The `ha-manager status` output includes a fencing entry that shows the CRM
+watchdog state. Each LRM entry additionally shows its own watchdog state.
+
+armed::
+
+The CRM is actively managing services and has its watchdog open. Each node's
+LRM also holds a watchdog while it has its agent lock. On quorum loss or
+daemon failure, the respective watchdog triggers a node reset to ensure safe
+failover.
+
+standby::
+
+The HA stack is ready but no CRM is actively running as master, for example
+when no HA resources are configured yet or the cluster just started. The CRM
+watchdog is not open. Fencing automatically transitions to `armed` once a CRM
+takes over as master.
+
+NOTE: The `watchdog-mux` service keeps the underlying `/dev/watchdog` device
+open for its entire lifetime, even when no HA client is connected. This
+prevents other processes from claiming the device and ensures the HA stack can
+always re-acquire it. Not all hardware watchdog drivers support magic close, so
+closing the device could trigger an unintended reset.
+
+
[[ha_manager_start_failure_policy]]
Start Failure Policy
---------------------
--
2.47.3
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH docs 2/2] ha-manager: document disarming and arming
2026-03-10 15:47 [PATCH docs 0/2] document disarm-ha, arm-ha and watchdog fencing status Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 1/2] ha-manager: document fencing & watchdog status Thomas Lamprecht
@ 2026-03-10 15:47 ` Thomas Lamprecht
1 sibling, 0 replies; 3+ messages in thread
From: Thomas Lamprecht @ 2026-03-10 15:47 UTC (permalink / raw)
To: pve-devel
Add a new section to document the new disarm-ha and arm-ha commands
and their interaction with some other commands or situations.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
ha-manager.adoc | 127 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)
diff --git a/ha-manager.adoc b/ha-manager.adoc
index ee254be..5547f7c 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -1024,6 +1024,19 @@ when no HA resources are configured yet or the cluster just started. The CRM
watchdog is not open. Fencing automatically transitions to `armed` once a CRM
takes over as master.
+disarming::
+
+A `disarm-ha` command was issued. The CRM is freezing or removing services
+from tracking and waiting for all LRMs to release their watchdogs. The CRM
+watchdog is still active during this phase. Each LRM entry's watchdog status
+changes to `released` as it acknowledges the disarm.
+
+disarmed::
+
+All watchdogs have been released cluster-wide. No automatic fencing,
+failover, or recovery takes place. See
+xref:ha_manager_disarm[Disarming HA for Cluster Maintenance].
+
NOTE: The `watchdog-mux` service keeps the underlying `/dev/watchdog` device
open for its entire lifetime, even when no HA client is connected. This
prevents other processes from claiming the device and ensures the HA stack can
@@ -1281,6 +1294,120 @@ NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
immediate node reboot or even reset.
+[[ha_manager_disarm]]
+Disarming HA for Cluster Maintenance
+-------------------------------------
+
+Certain cluster maintenance tasks, such as reconfiguring the network or the
+cluster communication stack (corosync), can cause temporary quorum loss or
+network partitions. Normally, HA would interpret this as a node failure and
+trigger self-fencing, disrupting services unnecessarily.
+
+The disarm mechanism releases all CRM and LRM watchdogs cluster-wide, allowing
+you to perform such maintenance safely without the risk of nodes being fenced.
+
+IMPORTANT: While disarmed, HA does not protect your services. Failures during
+this period are not automatically recovered. Keep the disarm window as short
+as possible.
+
+.Resource Modes
+
+When disarming HA, you must choose a resource mode that controls how HA
+managed resources are handled while disarmed. The current state of resources
+is not affected.
+
+freeze::
+
+New commands and state changes are not applied. Services stay in their current
+state, but the HA stack does not react to failures or process new requests.
+This is the safest choice when you expect all nodes to remain running.
+
+ignore::
+
+Resources are removed from HA tracking and can be managed as if they were not
+HA managed. This allows you to manually start, stop, or migrate services
+while HA is disarmed. Use this when you need to manually relocate services
+during maintenance.
+
+.Disarming and Re-Arming
+
+To disarm HA with the desired resource mode:
+
+----
+# ha-manager crm-command disarm-ha freeze
+----
+
+or:
+
+----
+# ha-manager crm-command disarm-ha ignore
+----
+
+To re-arm HA after maintenance is complete:
+
+----
+# ha-manager crm-command arm-ha
+----
+
+You can monitor the current state with:
+
+----
+# ha-manager status
+----
+
+The fencing status line shows the current state of the fencing mechanism (see
+xref:ha_manager_fencing_status[Fencing Status]), including the CRM and LRM
+watchdog states.
+
+.The Disarm Process
+
+After you request disarm, the following sequence happens:
+
+. The CRM freezes all services or removes them from tracking, depending on
+ the chosen resource mode.
+. Each LRM finishes its active workers, then releases its agent lock and
+ watchdog.
+. Once all online LRMs are idle, the CRM releases its own watchdog too.
+
+The CRM keeps the manager lock throughout this process, so it can accept and
+process the `arm-ha` command to reverse it.
+
+If any services are currently being fenced or recovered, the disarm is
+deferred until fencing completes. This ensures that partially fenced services
+do not end up in an inconsistent state.
+
+.Nodes Offline During Disarm
+
+If a node is offline when HA is disarmed, its LRM cannot process the disarm
+request. The CRM proceeds to the disarmed state once all *online* LRMs have
+completed their part. The offline node does not block this.
+
+When the offline node comes back online while HA is still disarmed, its LRM
+picks up the disarm state and releases its watchdog without attempting any
+service recovery.
+
+When you re-arm HA, any services that were on the offline node are handled
+according to normal HA recovery rules: they are fenced and recovered if the
+node is still unreachable, or restarted on the node if it has come back
+online.
+
+.Interaction with Maintenance Mode
+
+If a node is already in maintenance mode when disarm is requested, the
+maintenance migration continues until all services have been moved away. Once
+no active services and workers remain, the LRM releases its lock and watchdog
+as part of the disarm process.
+
+When HA is re-armed, the maintenance mode state is preserved. The node remains
+in maintenance and services are not moved back until maintenance mode is
+explicitly disabled.
+
+CAUTION: While the HA stack is disarmed, no automatic recovery, failover, or
+fencing takes place. A node failure during this window is not detected or
+handled by HA. Keep the disarm window as short as possible and ensure that the
+cluster is in a healthy state before re-arming.
+
+
[[ha_manager_crs]]
Cluster Resource Scheduling
---------------------------
--
2.47.3
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-10 15:53 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-10 15:47 [PATCH docs 0/2] document disarm-ha, arm-ha and watchdog fencing status Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 1/2] ha-manager: document fencing & watchdog status Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 2/2] ha-manager: document disarming and arming Thomas Lamprecht
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.