From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 5BBB61FF144
	for <inbox@lore.proxmox.com>; Tue, 10 Mar 2026 16:52:36 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id E68E934F6;
	Tue, 10 Mar 2026 16:52:28 +0100 (CET)
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH docs 2/2] ha-manager: document disarming and arming
Date: Tue, 10 Mar 2026 16:47:30 +0100
Message-ID: <20260310155216.2086316-3-t.lamprecht@proxmox.com>
X-Mailer: git-send-email 2.47.3
In-Reply-To: <20260310155216.2086316-1-t.lamprecht@proxmox.com>
References: <20260310155216.2086316-1-t.lamprecht@proxmox.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1773157909836
X-SPAM-LEVEL: Spam detection results:  0
	AWL                    -1.079 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.408 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.819 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.903 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: GDST3AINBCQQZMBEZVZ6GZP267OR63XH
X-Message-ID-Hash: GDST3AINBCQQZMBEZVZ6GZP267OR63XH
X-MailFrom: t.lamprecht@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

Add a new section to document the new disarm-ha and arm-ha commands
and their interaction with some other commands or situations.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
 ha-manager.adoc | 127 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 127 insertions(+)

diff --git a/ha-manager.adoc b/ha-manager.adoc
index ee254be..5547f7c 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -1024,6 +1024,19 @@ when no HA resources are configured yet or the cluster just started. The CRM
 watchdog is not open. Fencing automatically transitions to `armed` once a CRM
 takes over as master.
 
+disarming::
+
+A `disarm-ha` command was issued. The CRM is freezing or removing services
+from tracking and waiting for all LRMs to release their watchdogs. The CRM
+watchdog is still active during this phase. Each LRM entry's watchdog status
+changes to `released` as it acknowledges the disarm.
+
+disarmed::
+
+All watchdogs have been released cluster-wide. No automatic fencing,
+failover, or recovery takes place. See
+xref:ha_manager_disarm[Disarming HA for Cluster Maintenance].
+
 NOTE: The `watchdog-mux` service keeps the underlying `/dev/watchdog` device
 open for its entire lifetime, even when no HA client is connected. This
 prevents other processes from claiming the device and ensures the HA stack can
@@ -1281,6 +1294,120 @@ NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
 immediate node reboot or even reset.
 
 
+[[ha_manager_disarm]]
+Disarming HA for Cluster Maintenance
+-------------------------------------
+
+Certain cluster maintenance tasks, such as reconfiguring the network or the
+cluster communication stack (corosync), can cause temporary quorum loss or
+network partitions. Normally, HA would interpret this as a node failure and
+trigger self-fencing, disrupting services unnecessarily.
+
+The disarm mechanism releases all CRM and LRM watchdogs cluster-wide, allowing
+you to perform such maintenance safely without the risk of nodes being fenced.
+
+IMPORTANT: While disarmed, HA does not protect your services. Failures during
+this period are not automatically recovered. Keep the disarm window as short
+as possible.
+
+.Resource Modes
+
+When disarming HA, you must choose a resource mode that controls how HA
+managed resources are handled while disarmed. The current state of resources
+is not affected.
+
+freeze::
+
+New commands and state changes are not applied. Services stay in their current
+state, but the HA stack does not react to failures or process new requests.
+This is the safest choice when you expect all nodes to remain running.
+
+ignore::
+
+Resources are removed from HA tracking and can be managed as if they were not
+HA managed. This allows you to manually start, stop, or migrate services
+while HA is disarmed. Use this when you need to manually relocate services
+during maintenance.
+
+.Disarming and Re-Arming
+
+To disarm HA with the desired resource mode:
+
+----
+# ha-manager crm-command disarm-ha freeze
+----
+
+or:
+
+----
+# ha-manager crm-command disarm-ha ignore
+----
+
+To re-arm HA after maintenance is complete:
+
+----
+# ha-manager crm-command arm-ha
+----
+
+You can monitor the current state with:
+
+----
+# ha-manager status
+----
+
+The fencing status line shows the current state of the fencing mechanism (see
+xref:ha_manager_fencing_status[Fencing Status]), including the CRM and LRM
+watchdog states.
+
+.The Disarm Process
+
+After you request disarm, the following sequence happens:
+
+. The CRM freezes all services or removes them from tracking, depending on
+  the chosen resource mode.
+. Each LRM finishes its active workers, then releases its agent lock and
+  watchdog.
+. Once all online LRMs are idle, the CRM releases its own watchdog too.
+
+The CRM keeps the manager lock throughout this process, so it can accept and
+process the `arm-ha` command to reverse it.
+
+If any services are currently being fenced or recovered, the disarm is
+deferred until fencing completes. This ensures that partially fenced services
+do not end up in an inconsistent state.
+
+.Nodes Offline During Disarm
+
+If a node is offline when HA is disarmed, its LRM cannot process the disarm
+request. The CRM proceeds to the disarmed state once all *online* LRMs have
+completed their part. The offline node does not block this.
+
+When the offline node comes back online while HA is still disarmed, its LRM
+picks up the disarm state and releases its watchdog without attempting any
+service recovery.
+
+When you re-arm HA, any services that were on the offline node are handled
+according to normal HA recovery rules: they are fenced and recovered if the
+node is still unreachable, or restarted on the node if it has come back
+online.
+
+.Interaction with Maintenance Mode
+
+If a node is already in maintenance mode when disarm is requested, the
+maintenance migration continues until all services have been moved away. Once
+no active services and workers remain, the LRM releases its lock and watchdog
+as part of the disarm process.
+
+When HA is re-armed, the maintenance mode state is preserved. The node remains
+in maintenance and services are not moved back until maintenance mode is
+explicitly disabled.
+
+CAUTION: While the HA stack is disarmed, no automatic recovery, failover, or
+fencing takes place. A node failure during this window is not detected or
+handled by HA. Keep the disarm window as short as possible and ensure that the
+cluster is in a healthy state before re-arming.
+
+
 [[ha_manager_crs]]
 Cluster Resource Scheduling
 ---------------------------
-- 
2.47.3