From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 5BBB61FF144 for ; Tue, 10 Mar 2026 16:52:36 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id E68E934F6; Tue, 10 Mar 2026 16:52:28 +0100 (CET) From: Thomas Lamprecht To: pve-devel@lists.proxmox.com Subject: [PATCH docs 2/2] ha-manager: document disarming and arming Date: Tue, 10 Mar 2026 16:47:30 +0100 Message-ID: <20260310155216.2086316-3-t.lamprecht@proxmox.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260310155216.2086316-1-t.lamprecht@proxmox.com> References: <20260310155216.2086316-1-t.lamprecht@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1773157909836 X-SPAM-LEVEL: Spam detection results: 0 AWL -1.079 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.408 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.819 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.903 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: GDST3AINBCQQZMBEZVZ6GZP267OR63XH X-Message-ID-Hash: GDST3AINBCQQZMBEZVZ6GZP267OR63XH X-MailFrom: t.lamprecht@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Add a new section to document the new disarm-ha and arm-ha commands and their interaction with some other commands or situations. Signed-off-by: Thomas Lamprecht --- ha-manager.adoc | 127 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 127 insertions(+) diff --git a/ha-manager.adoc b/ha-manager.adoc index ee254be..5547f7c 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -1024,6 +1024,19 @@ when no HA resources are configured yet or the cluster just started. The CRM watchdog is not open. Fencing automatically transitions to `armed` once a CRM takes over as master. +disarming:: + +A `disarm-ha` command was issued. The CRM is freezing or removing services +from tracking and waiting for all LRMs to release their watchdogs. The CRM +watchdog is still active during this phase. Each LRM entry's watchdog status +changes to `released` as it acknowledges the disarm. + +disarmed:: + +All watchdogs have been released cluster-wide. No automatic fencing, +failover, or recovery takes place. See +xref:ha_manager_disarm[Disarming HA for Cluster Maintenance]. + NOTE: The `watchdog-mux` service keeps the underlying `/dev/watchdog` device open for its entire lifetime, even when no HA client is connected. This prevents other processes from claiming the device and ensures the HA stack can @@ -1281,6 +1294,120 @@ NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or immediate node reboot or even reset. +[[ha_manager_disarm]] +Disarming HA for Cluster Maintenance +------------------------------------- + +Certain cluster maintenance tasks, such as reconfiguring the network or the +cluster communication stack (corosync), can cause temporary quorum loss or +network partitions. Normally, HA would interpret this as a node failure and +trigger self-fencing, disrupting services unnecessarily. + +The disarm mechanism releases all CRM and LRM watchdogs cluster-wide, allowing +you to perform such maintenance safely without the risk of nodes being fenced. + +IMPORTANT: While disarmed, HA does not protect your services. Failures during +this period are not automatically recovered. Keep the disarm window as short +as possible. + +.Resource Modes + +When disarming HA, you must choose a resource mode that controls how HA +managed resources are handled while disarmed. The current state of resources +is not affected. + +freeze:: + +New commands and state changes are not applied. Services stay in their current +state, but the HA stack does not react to failures or process new requests. +This is the safest choice when you expect all nodes to remain running. + +ignore:: + +Resources are removed from HA tracking and can be managed as if they were not +HA managed. This allows you to manually start, stop, or migrate services +while HA is disarmed. Use this when you need to manually relocate services +during maintenance. + +.Disarming and Re-Arming + +To disarm HA with the desired resource mode: + +---- +# ha-manager crm-command disarm-ha freeze +---- + +or: + +---- +# ha-manager crm-command disarm-ha ignore +---- + +To re-arm HA after maintenance is complete: + +---- +# ha-manager crm-command arm-ha +---- + +You can monitor the current state with: + +---- +# ha-manager status +---- + +The fencing status line shows the current state of the fencing mechanism (see +xref:ha_manager_fencing_status[Fencing Status]), including the CRM and LRM +watchdog states. + +.The Disarm Process + +After you request disarm, the following sequence happens: + +. The CRM freezes all services or removes them from tracking, depending on + the chosen resource mode. +. Each LRM finishes its active workers, then releases its agent lock and + watchdog. +. Once all online LRMs are idle, the CRM releases its own watchdog too. + +The CRM keeps the manager lock throughout this process, so it can accept and +process the `arm-ha` command to reverse it. + +If any services are currently being fenced or recovered, the disarm is +deferred until fencing completes. This ensures that partially fenced services +do not end up in an inconsistent state. + +.Nodes Offline During Disarm + +If a node is offline when HA is disarmed, its LRM cannot process the disarm +request. The CRM proceeds to the disarmed state once all *online* LRMs have +completed their part. The offline node does not block this. + +When the offline node comes back online while HA is still disarmed, its LRM +picks up the disarm state and releases its watchdog without attempting any +service recovery. + +When you re-arm HA, any services that were on the offline node are handled +according to normal HA recovery rules: they are fenced and recovered if the +node is still unreachable, or restarted on the node if it has come back +online. + +.Interaction with Maintenance Mode + +If a node is already in maintenance mode when disarm is requested, the +maintenance migration continues until all services have been moved away. Once +no active services and workers remain, the LRM releases its lock and watchdog +as part of the disarm process. + +When HA is re-armed, the maintenance mode state is preserved. The node remains +in maintenance and services are not moved back until maintenance mode is +explicitly disabled. + +CAUTION: While the HA stack is disarmed, no automatic recovery, failover, or +fencing takes place. A node failure during this window is not detected or +handled by HA. Keep the disarm window as short as possible and ensure that the +cluster is in a healthy state before re-arming. + + [[ha_manager_crs]] Cluster Resource Scheduling --------------------------- -- 2.47.3