* [PATCH docs 0/2] document disarm-ha, arm-ha and watchdog fencing status
@ 2026-03-10 15:47 Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 1/2] ha-manager: document fencing & watchdog status Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 2/2] ha-manager: document disarming and arming Thomas Lamprecht
0 siblings, 2 replies; 3+ messages in thread
From: Thomas Lamprecht @ 2026-03-10 15:47 UTC (permalink / raw)
To: pve-devel
This is the documentation part for the recently sent implementation:
https://lore.proxmox.com/all/20260309220128.973793-1-t.lamprecht@proxmox.com/
Thomas Lamprecht (2):
ha-manager: document fencing & watchdog status
ha-manager: document disarming and arming
ha-manager.adoc | 155 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 155 insertions(+)
--
2.47.3
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH docs 1/2] ha-manager: document fencing & watchdog status
2026-03-10 15:47 [PATCH docs 0/2] document disarm-ha, arm-ha and watchdog fencing status Thomas Lamprecht
@ 2026-03-10 15:47 ` Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 2/2] ha-manager: document disarming and arming Thomas Lamprecht
1 sibling, 0 replies; 3+ messages in thread
From: Thomas Lamprecht @ 2026-03-10 15:47 UTC (permalink / raw)
To: pve-devel
This accompanies the recent changes in the ha-manager's status API
endpoint to also include an explicit fencing/watchdog status.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
ha-manager.adoc | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/ha-manager.adoc b/ha-manager.adoc
index 4c318fb..ee254be 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -1003,6 +1003,34 @@ can lead to high load, especially on small clusters. Please design
your cluster so that it can handle such worst case scenarios.
+[[ha_manager_fencing_status]]
+Fencing & Watchdog Status
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The `ha-manager status` output includes a fencing entry that shows the CRM
+watchdog state. Each LRM entry additionally shows its own watchdog state.
+
+armed::
+
+The CRM is actively managing services and has its watchdog open. Each node's
+LRM also holds a watchdog while it has its agent lock. On quorum loss or
+daemon failure, the respective watchdog triggers a node reset to ensure safe
+failover.
+
+standby::
+
+The HA stack is ready but no CRM is actively running as master, for example
+when no HA resources are configured yet or the cluster just started. The CRM
+watchdog is not open. Fencing automatically transitions to `armed` once a CRM
+takes over as master.
+
+NOTE: The `watchdog-mux` service keeps the underlying `/dev/watchdog` device
+open for its entire lifetime, even when no HA client is connected. This
+prevents other processes from claiming the device and ensures the HA stack can
+always re-acquire it. Not all hardware watchdog drivers support magic close, so
+closing the device could trigger an unintended reset.
+
+
[[ha_manager_start_failure_policy]]
Start Failure Policy
---------------------
--
2.47.3
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH docs 2/2] ha-manager: document disarming and arming
2026-03-10 15:47 [PATCH docs 0/2] document disarm-ha, arm-ha and watchdog fencing status Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 1/2] ha-manager: document fencing & watchdog status Thomas Lamprecht
@ 2026-03-10 15:47 ` Thomas Lamprecht
1 sibling, 0 replies; 3+ messages in thread
From: Thomas Lamprecht @ 2026-03-10 15:47 UTC (permalink / raw)
To: pve-devel
Add a new section to document the new disarm-ha and arm-ha commands
and their interaction with some other commands or situations.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
ha-manager.adoc | 127 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)
diff --git a/ha-manager.adoc b/ha-manager.adoc
index ee254be..5547f7c 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -1024,6 +1024,19 @@ when no HA resources are configured yet or the cluster just started. The CRM
watchdog is not open. Fencing automatically transitions to `armed` once a CRM
takes over as master.
+disarming::
+
+A `disarm-ha` command was issued. The CRM is freezing or removing services
+from tracking and waiting for all LRMs to release their watchdogs. The CRM
+watchdog is still active during this phase. Each LRM entry's watchdog status
+changes to `released` as it acknowledges the disarm.
+
+disarmed::
+
+All watchdogs have been released cluster-wide. No automatic fencing,
+failover, or recovery takes place. See
+xref:ha_manager_disarm[Disarming HA for Cluster Maintenance].
+
NOTE: The `watchdog-mux` service keeps the underlying `/dev/watchdog` device
open for its entire lifetime, even when no HA client is connected. This
prevents other processes from claiming the device and ensures the HA stack can
@@ -1281,6 +1294,120 @@ NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
immediate node reboot or even reset.
+[[ha_manager_disarm]]
+Disarming HA for Cluster Maintenance
+-------------------------------------
+
+Certain cluster maintenance tasks, such as reconfiguring the network or the
+cluster communication stack (corosync), can cause temporary quorum loss or
+network partitions. Normally, HA would interpret this as a node failure and
+trigger self-fencing, disrupting services unnecessarily.
+
+The disarm mechanism releases all CRM and LRM watchdogs cluster-wide, allowing
+you to perform such maintenance safely without the risk of nodes being fenced.
+
+IMPORTANT: While disarmed, HA does not protect your services. Failures during
+this period are not automatically recovered. Keep the disarm window as short
+as possible.
+
+.Resource Modes
+
+When disarming HA, you must choose a resource mode that controls how HA
+managed resources are handled while disarmed. The current state of resources
+is not affected.
+
+freeze::
+
+New commands and state changes are not applied. Services stay in their current
+state, but the HA stack does not react to failures or process new requests.
+This is the safest choice when you expect all nodes to remain running.
+
+ignore::
+
+Resources are removed from HA tracking and can be managed as if they were not
+HA managed. This allows you to manually start, stop, or migrate services
+while HA is disarmed. Use this when you need to manually relocate services
+during maintenance.
+
+.Disarming and Re-Arming
+
+To disarm HA with the desired resource mode:
+
+----
+# ha-manager crm-command disarm-ha freeze
+----
+
+or:
+
+----
+# ha-manager crm-command disarm-ha ignore
+----
+
+To re-arm HA after maintenance is complete:
+
+----
+# ha-manager crm-command arm-ha
+----
+
+You can monitor the current state with:
+
+----
+# ha-manager status
+----
+
+The fencing status line shows the current state of the fencing mechanism (see
+xref:ha_manager_fencing_status[Fencing Status]), including the CRM and LRM
+watchdog states.
+
+.The Disarm Process
+
+After you request disarm, the following sequence happens:
+
+. The CRM freezes all services or removes them from tracking, depending on
+ the chosen resource mode.
+. Each LRM finishes its active workers, then releases its agent lock and
+ watchdog.
+. Once all online LRMs are idle, the CRM releases its own watchdog too.
+
+The CRM keeps the manager lock throughout this process, so it can accept and
+process the `arm-ha` command to reverse it.
+
+If any services are currently being fenced or recovered, the disarm is
+deferred until fencing completes. This ensures that partially fenced services
+do not end up in an inconsistent state.
+
+.Nodes Offline During Disarm
+
+If a node is offline when HA is disarmed, its LRM cannot process the disarm
+request. The CRM proceeds to the disarmed state once all *online* LRMs have
+completed their part. The offline node does not block this.
+
+When the offline node comes back online while HA is still disarmed, its LRM
+picks up the disarm state and releases its watchdog without attempting any
+service recovery.
+
+When you re-arm HA, any services that were on the offline node are handled
+according to normal HA recovery rules: they are fenced and recovered if the
+node is still unreachable, or restarted on the node if it has come back
+online.
+
+.Interaction with Maintenance Mode
+
+If a node is already in maintenance mode when disarm is requested, the
+maintenance migration continues until all services have been moved away. Once
+no active services and workers remain, the LRM releases its lock and watchdog
+as part of the disarm process.
+
+When HA is re-armed, the maintenance mode state is preserved. The node remains
+in maintenance and services are not moved back until maintenance mode is
+explicitly disabled.
+
+CAUTION: While the HA stack is disarmed, no automatic recovery, failover, or
+fencing takes place. A node failure during this window is not detected or
+handled by HA. Keep the disarm window as short as possible and ensure that the
+cluster is in a healthy state before re-arming.
+
+
[[ha_manager_crs]]
Cluster Resource Scheduling
---------------------------
--
2.47.3
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-10 15:53 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-10 15:47 [PATCH docs 0/2] document disarm-ha, arm-ha and watchdog fencing status Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 1/2] ha-manager: document fencing & watchdog status Thomas Lamprecht
2026-03-10 15:47 ` [PATCH docs 2/2] ha-manager: document disarming and arming Thomas Lamprecht
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox