public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance
@ 2026-03-21 23:42 Thomas Lamprecht
  2026-03-21 23:42 ` [PATCH ha-manager v2 1/4] sim: hardware: add manual-migrate command for ignored services Thomas Lamprecht
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2026-03-21 23:42 UTC (permalink / raw)
  To: pve-devel

The biggest change compared to v1 is how ignore mode handles the service
status: instead of clearing it entirely, the relevant parts of service
status are now preserved across the disarm/arm cycle. This allows
runtime state like maintenance_node to survive, so services correctly
migrate back to their original node after maintenance ends, even if the
disarm happened while maintenance was active. Thanks @Dominik R. for
noticing this.

To keep the preserved state clean, stale runtime data (failed_nodes,
cmd, target, ...) is pruned from service entries on disarm - both in
freeze and ignore mode - so the state machine starts fresh on re-arm.
The status API overrides the displayed service state to 'ignore' during
disarm-ignore mode, while the internal state stays untouched for
seamless resume.

On arm-ha from ignore mode, the CRM now rechecks the previous resource's
node against the resource service config, picking up any manual
migrations the admin performed while HA tracking was suspended.

First patch 1/4 is new and adds a manual-migrate simulator command as a
preparatory patch, since it is independently useful for testing the
per-service 'ignored' state handling.

Previous discussion and v1:
https://lore.proxmox.com/pve-devel/20260309220128.973793-1-t.lamprecht@proxmox.com/

TBD:
- some more in-depth (real-world) testing
- UI integration
- maybe some more polishing

changes v1 -> v2:
- ignore mode: preserve relevant service status instead of clearing it,
  recheck node info on arm-ha for manual migrations [Dominik]
- prune stale runtime data from service entries on disarm for both modes
- add 'protected => 1' to both API endpoints [Dominik]
- split out manual-migrate sim command as preparatory patch
- various style, log level, and test improvements (see per-patch
  changelogs for details)

Thomas Lamprecht (4):
  sim: hardware: add manual-migrate command for ignored services
  api: status: add fencing status entry with armed/standby state
  fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
  api: status: add disarm-ha and arm-ha endpoints and CLI wiring

 src/PVE/API2/HA/Status.pm                     | 143 ++++++++++++-
 src/PVE/CLI/ha_manager.pm                     |   2 +
 src/PVE/HA/CRM.pm                             |  33 ++-
 src/PVE/HA/Config.pm                          |   5 +
 src/PVE/HA/LRM.pm                             |  31 ++-
 src/PVE/HA/Manager.pm                         | 197 ++++++++++++++++--
 src/PVE/HA/Sim/Hardware.pm                    |  36 ++++
 src/test/test-disarm-crm-stop1/README         |  13 ++
 src/test/test-disarm-crm-stop1/cmdlist        |   6 +
 .../test-disarm-crm-stop1/hardware_status     |   5 +
 src/test/test-disarm-crm-stop1/log.expect     |  66 ++++++
 src/test/test-disarm-crm-stop1/manager_status |   1 +
 src/test/test-disarm-crm-stop1/service_config |   5 +
 src/test/test-disarm-double1/cmdlist          |   7 +
 src/test/test-disarm-double1/hardware_status  |   5 +
 src/test/test-disarm-double1/log.expect       |  53 +++++
 src/test/test-disarm-double1/manager_status   |   1 +
 src/test/test-disarm-double1/service_config   |   4 +
 src/test/test-disarm-failing-service1/cmdlist |   6 +
 .../hardware_status                           |   5 +
 .../test-disarm-failing-service1/log.expect   | 125 +++++++++++
 .../manager_status                            |   1 +
 .../service_config                            |   4 +
 src/test/test-disarm-fence1/cmdlist           |   9 +
 src/test/test-disarm-fence1/hardware_status   |   5 +
 src/test/test-disarm-fence1/log.expect        |  78 +++++++
 src/test/test-disarm-fence1/manager_status    |   1 +
 src/test/test-disarm-fence1/service_config    |   5 +
 src/test/test-disarm-frozen1/README           |  10 +
 src/test/test-disarm-frozen1/cmdlist          |   5 +
 src/test/test-disarm-frozen1/hardware_status  |   5 +
 src/test/test-disarm-frozen1/log.expect       |  59 ++++++
 src/test/test-disarm-frozen1/manager_status   |   1 +
 src/test/test-disarm-frozen1/service_config   |   5 +
 src/test/test-disarm-ignored1/README          |  10 +
 src/test/test-disarm-ignored1/cmdlist         |   5 +
 src/test/test-disarm-ignored1/hardware_status |   5 +
 src/test/test-disarm-ignored1/log.expect      |  50 +++++
 src/test/test-disarm-ignored1/manager_status  |   1 +
 src/test/test-disarm-ignored1/service_config  |   5 +
 src/test/test-disarm-ignored2/cmdlist         |   6 +
 src/test/test-disarm-ignored2/hardware_status |   5 +
 src/test/test-disarm-ignored2/log.expect      |  60 ++++++
 src/test/test-disarm-ignored2/manager_status  |   1 +
 src/test/test-disarm-ignored2/service_config  |   5 +
 src/test/test-disarm-maintenance1/cmdlist     |   7 +
 .../test-disarm-maintenance1/hardware_status  |   5 +
 src/test/test-disarm-maintenance1/log.expect  |  79 +++++++
 .../test-disarm-maintenance1/manager_status   |   1 +
 .../test-disarm-maintenance1/service_config   |   5 +
 src/test/test-disarm-maintenance2/cmdlist     |   7 +
 .../test-disarm-maintenance2/hardware_status  |   5 +
 src/test/test-disarm-maintenance2/log.expect  |  78 +++++++
 .../test-disarm-maintenance2/manager_status   |   1 +
 .../test-disarm-maintenance2/service_config   |   5 +
 src/test/test-disarm-maintenance3/cmdlist     |   8 +
 .../test-disarm-maintenance3/hardware_status  |   5 +
 src/test/test-disarm-maintenance3/log.expect  |  80 +++++++
 .../test-disarm-maintenance3/manager_status   |   1 +
 .../test-disarm-maintenance3/service_config   |   5 +
 src/test/test-disarm-relocate1/README         |   3 +
 src/test/test-disarm-relocate1/cmdlist        |   7 +
 .../test-disarm-relocate1/hardware_status     |   5 +
 src/test/test-disarm-relocate1/log.expect     |  51 +++++
 src/test/test-disarm-relocate1/manager_status |   1 +
 src/test/test-disarm-relocate1/service_config |   4 +
 src/test/test-manual-migrate-ignored1/cmdlist |   7 +
 .../hardware_status                           |   5 +
 .../test-manual-migrate-ignored1/log.expect   |  44 ++++
 .../manager_status                            |   1 +
 .../service_config                            |   5 +
 71 files changed, 1481 insertions(+), 34 deletions(-)
 create mode 100644 src/test/test-disarm-crm-stop1/README
 create mode 100644 src/test/test-disarm-crm-stop1/cmdlist
 create mode 100644 src/test/test-disarm-crm-stop1/hardware_status
 create mode 100644 src/test/test-disarm-crm-stop1/log.expect
 create mode 100644 src/test/test-disarm-crm-stop1/manager_status
 create mode 100644 src/test/test-disarm-crm-stop1/service_config
 create mode 100644 src/test/test-disarm-double1/cmdlist
 create mode 100644 src/test/test-disarm-double1/hardware_status
 create mode 100644 src/test/test-disarm-double1/log.expect
 create mode 100644 src/test/test-disarm-double1/manager_status
 create mode 100644 src/test/test-disarm-double1/service_config
 create mode 100644 src/test/test-disarm-failing-service1/cmdlist
 create mode 100644 src/test/test-disarm-failing-service1/hardware_status
 create mode 100644 src/test/test-disarm-failing-service1/log.expect
 create mode 100644 src/test/test-disarm-failing-service1/manager_status
 create mode 100644 src/test/test-disarm-failing-service1/service_config
 create mode 100644 src/test/test-disarm-fence1/cmdlist
 create mode 100644 src/test/test-disarm-fence1/hardware_status
 create mode 100644 src/test/test-disarm-fence1/log.expect
 create mode 100644 src/test/test-disarm-fence1/manager_status
 create mode 100644 src/test/test-disarm-fence1/service_config
 create mode 100644 src/test/test-disarm-frozen1/README
 create mode 100644 src/test/test-disarm-frozen1/cmdlist
 create mode 100644 src/test/test-disarm-frozen1/hardware_status
 create mode 100644 src/test/test-disarm-frozen1/log.expect
 create mode 100644 src/test/test-disarm-frozen1/manager_status
 create mode 100644 src/test/test-disarm-frozen1/service_config
 create mode 100644 src/test/test-disarm-ignored1/README
 create mode 100644 src/test/test-disarm-ignored1/cmdlist
 create mode 100644 src/test/test-disarm-ignored1/hardware_status
 create mode 100644 src/test/test-disarm-ignored1/log.expect
 create mode 100644 src/test/test-disarm-ignored1/manager_status
 create mode 100644 src/test/test-disarm-ignored1/service_config
 create mode 100644 src/test/test-disarm-ignored2/cmdlist
 create mode 100644 src/test/test-disarm-ignored2/hardware_status
 create mode 100644 src/test/test-disarm-ignored2/log.expect
 create mode 100644 src/test/test-disarm-ignored2/manager_status
 create mode 100644 src/test/test-disarm-ignored2/service_config
 create mode 100644 src/test/test-disarm-maintenance1/cmdlist
 create mode 100644 src/test/test-disarm-maintenance1/hardware_status
 create mode 100644 src/test/test-disarm-maintenance1/log.expect
 create mode 100644 src/test/test-disarm-maintenance1/manager_status
 create mode 100644 src/test/test-disarm-maintenance1/service_config
 create mode 100644 src/test/test-disarm-maintenance2/cmdlist
 create mode 100644 src/test/test-disarm-maintenance2/hardware_status
 create mode 100644 src/test/test-disarm-maintenance2/log.expect
 create mode 100644 src/test/test-disarm-maintenance2/manager_status
 create mode 100644 src/test/test-disarm-maintenance2/service_config
 create mode 100644 src/test/test-disarm-maintenance3/cmdlist
 create mode 100644 src/test/test-disarm-maintenance3/hardware_status
 create mode 100644 src/test/test-disarm-maintenance3/log.expect
 create mode 100644 src/test/test-disarm-maintenance3/manager_status
 create mode 100644 src/test/test-disarm-maintenance3/service_config
 create mode 100644 src/test/test-disarm-relocate1/README
 create mode 100644 src/test/test-disarm-relocate1/cmdlist
 create mode 100644 src/test/test-disarm-relocate1/hardware_status
 create mode 100644 src/test/test-disarm-relocate1/log.expect
 create mode 100644 src/test/test-disarm-relocate1/manager_status
 create mode 100644 src/test/test-disarm-relocate1/service_config
 create mode 100644 src/test/test-manual-migrate-ignored1/cmdlist
 create mode 100644 src/test/test-manual-migrate-ignored1/hardware_status
 create mode 100644 src/test/test-manual-migrate-ignored1/log.expect
 create mode 100644 src/test/test-manual-migrate-ignored1/manager_status
 create mode 100644 src/test/test-manual-migrate-ignored1/service_config

-- 
2.47.3





^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH ha-manager v2 1/4] sim: hardware: add manual-migrate command for ignored services
  2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Thomas Lamprecht
@ 2026-03-21 23:42 ` Thomas Lamprecht
  2026-03-21 23:42 ` [PATCH ha-manager v2 2/4] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2026-03-21 23:42 UTC (permalink / raw)
  To: pve-devel

Add a 'manual-migrate' action to the simulator's service command
handler, allowing tests to simulate an admin migrating a VM outside
of HA control.

The command is guarded to only work when the service has 'ignored'
request state, mirroring the real-world constraint that only
services not actively managed by HA can be manually migrated.

Uses the same method in the HA sim as used for "stealing" on recovery of
fenced services.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---

New in v2.

 src/PVE/HA/Sim/Hardware.pm                    | 20 +++++++++
 src/test/test-manual-migrate-ignored1/cmdlist |  7 +++
 .../hardware_status                           |  5 +++
 .../test-manual-migrate-ignored1/log.expect   | 44 +++++++++++++++++++
 .../manager_status                            |  1 +
 .../service_config                            |  5 +++
 6 files changed, 82 insertions(+)
 create mode 100644 src/test/test-manual-migrate-ignored1/cmdlist
 create mode 100644 src/test/test-manual-migrate-ignored1/hardware_status
 create mode 100644 src/test/test-manual-migrate-ignored1/log.expect
 create mode 100644 src/test/test-manual-migrate-ignored1/manager_status
 create mode 100644 src/test/test-manual-migrate-ignored1/service_config

diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
index 8cbf48d..301c391 100644
--- a/src/PVE/HA/Sim/Hardware.pm
+++ b/src/PVE/HA/Sim/Hardware.pm
@@ -879,6 +879,26 @@ sub sim_hardware_cmd {
                     { maxcpu => $params[0], maxmem => $params[1] },
                 );
 
+            } elsif ($action eq 'manual-migrate') {
+
+                die "sim_hardware_cmd: missing target node for '$action' command"
+                    if !$param;
+
+                my $conf = $self->read_service_config();
+
+                die "sim_hardware_cmd: service '$sid' not configured\n"
+                    if !$conf->{$sid};
+
+                my $current_node = $conf->{$sid}->{node}
+                    || die "sim_hardware_cmd: service '$sid' has no node\n";
+
+                die "sim_hardware_cmd: manual-migrate requires service"
+                    . " in 'ignored' state\n"
+                    if !defined($conf->{$sid}->{state})
+                    || $conf->{$sid}->{state} ne 'ignored';
+
+                $self->change_service_location($sid, $current_node, $param);
+
             } elsif ($action eq 'delete') {
 
                 $self->delete_service($sid);
diff --git a/src/test/test-manual-migrate-ignored1/cmdlist b/src/test/test-manual-migrate-ignored1/cmdlist
new file mode 100644
index 0000000..a791b3a
--- /dev/null
+++ b/src/test/test-manual-migrate-ignored1/cmdlist
@@ -0,0 +1,7 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "service vm:103 ignored" ],
+    [ "service vm:103 manual-migrate node1" ],
+    [ "service vm:103 started" ],
+    []
+]
diff --git a/src/test/test-manual-migrate-ignored1/hardware_status b/src/test/test-manual-migrate-ignored1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-manual-migrate-ignored1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-manual-migrate-ignored1/log.expect b/src/test/test-manual-migrate-ignored1/log.expect
new file mode 100644
index 0000000..0060d76
--- /dev/null
+++ b/src/test/test-manual-migrate-ignored1/log.expect
@@ -0,0 +1,44 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node2)
+info     20    node1/crm: service 'vm:103': state changed from 'request_start' to 'started'  (node = node3)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service vm:102
+info     23    node2/lrm: service status vm:102 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info    120      cmdlist: execute service vm:103 ignored
+info    120    node1/crm: removing stale service 'vm:103' (ignored state requested)
+info    220      cmdlist: execute service vm:103 manual-migrate node1
+info    320      cmdlist: execute service vm:103 started
+info    320    node1/crm: adding new service 'vm:103' on node 'node1'
+info    320    node1/crm: service 'vm:103': state changed from 'request_start' to 'started'  (node = node1)
+info    321    node1/lrm: starting service vm:103
+info    321    node1/lrm: service status vm:103 started
+info    920     hardware: exit simulation - done
diff --git a/src/test/test-manual-migrate-ignored1/manager_status b/src/test/test-manual-migrate-ignored1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-manual-migrate-ignored1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-manual-migrate-ignored1/service_config b/src/test/test-manual-migrate-ignored1/service_config
new file mode 100644
index 0000000..4b26f6b
--- /dev/null
+++ b/src/test/test-manual-migrate-ignored1/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
-- 
2.47.3





^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH ha-manager v2 2/4] api: status: add fencing status entry with armed/standby state
  2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Thomas Lamprecht
  2026-03-21 23:42 ` [PATCH ha-manager v2 1/4] sim: hardware: add manual-migrate command for ignored services Thomas Lamprecht
@ 2026-03-21 23:42 ` Thomas Lamprecht
  2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2026-03-21 23:42 UTC (permalink / raw)
  To: pve-devel

Add a fencing entry to the HA status output that shows whether the
fencing mechanism is active or idle. The CRM only opens the watchdog
when actively running as master, so distinguish between:

- armed: CRM is active master, watchdog connected
- standby: no active CRM master (e.g. no services configured, cluster
  just started), watchdog not open

Each LRM entry additionally shows its per-node watchdog state. The LRM
holds its watchdog while it has the agent lock (active or maintenance
state).

Previously there was no indication of the fencing state at all, which
made it hard to tell whether the watchdog was actually protecting the
cluster.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---

No changes since v1.

 src/PVE/API2/HA/Status.pm | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/src/PVE/API2/HA/Status.pm b/src/PVE/API2/HA/Status.pm
index a1e5787..a6b00b9 100644
--- a/src/PVE/API2/HA/Status.pm
+++ b/src/PVE/API2/HA/Status.pm
@@ -91,7 +91,7 @@ __PACKAGE__->register_method({
                 },
                 type => {
                     description => "Type of status entry.",
-                    enum => ["quorum", "master", "lrm", "service"],
+                    enum => ["quorum", "master", "lrm", "service", "fencing"],
                 },
                 quorate => {
                     description => "For type 'quorum'. Whether the cluster is quorate or not.",
@@ -143,6 +143,13 @@ __PACKAGE__->register_method({
                     type => "string",
                     optional => 1,
                 },
+                'armed-state' => {
+                    description => "For type 'fencing'. Whether HA fencing is armed"
+                        . " or on standby.",
+                    type => "string",
+                    enum => ['armed', 'standby'],
+                    optional => 1,
+                },
             },
         },
     },
@@ -193,6 +200,23 @@ __PACKAGE__->register_method({
                 };
         }
 
+        # the CRM only opens the watchdog when actively running as master
+        my $crm_active =
+            defined($status->{master_node})
+            && defined($status->{timestamp})
+            && $timestamp_to_status->($ctime, $status->{timestamp}) eq 'active';
+
+        my $armed_state = $crm_active ? 'armed' : 'standby';
+        my $crm_wd = $crm_active ? "CRM watchdog active" : "CRM watchdog standby";
+        push @$res,
+            {
+                id => 'fencing',
+                type => 'fencing',
+                node => $status->{master_node} // $nodename,
+                status => "$armed_state ($crm_wd)",
+                'armed-state' => $armed_state,
+            };
+
         foreach my $node (sort keys %{ $status->{node_status} }) {
             my $active_count =
                 PVE::HA::Tools::count_active_services($status->{service_status}, $node);
@@ -209,10 +233,17 @@ __PACKAGE__->register_method({
             } else {
                 my $status_str = &$timestamp_to_status($ctime, $lrm_status->{timestamp});
                 my $lrm_mode = $lrm_status->{mode};
+                my $lrm_state = $lrm_status->{state} || 'unknown';
+
+                # LRM holds its watchdog while it has the agent lock
+                my $lrm_wd =
+                    ($status_str eq 'active'
+                        && ($lrm_state eq 'active' || $lrm_state eq 'maintenance'))
+                    ? 'watchdog active'
+                    : 'watchdog standby';
 
                 if ($status_str eq 'active') {
                     $lrm_mode ||= 'active';
-                    my $lrm_state = $lrm_status->{state} || 'unknown';
                     if ($lrm_mode ne 'active') {
                         $status_str = "$lrm_mode mode";
                     } else {
@@ -227,7 +258,7 @@ __PACKAGE__->register_method({
                 }
 
                 my $time_str = localtime($lrm_status->{timestamp});
-                my $status_text = "$node ($status_str, $time_str)";
+                my $status_text = "$node ($status_str, $lrm_wd, $time_str)";
                 push @$res,
                     {
                         id => $id,
-- 
2.47.3





^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
  2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Thomas Lamprecht
  2026-03-21 23:42 ` [PATCH ha-manager v2 1/4] sim: hardware: add manual-migrate command for ignored services Thomas Lamprecht
  2026-03-21 23:42 ` [PATCH ha-manager v2 2/4] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
@ 2026-03-21 23:42 ` Thomas Lamprecht
  2026-03-23 13:04   ` Dominik Rusovac
                     ` (2 more replies)
  2026-03-21 23:42 ` [PATCH ha-manager v2 4/4] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2026-03-21 23:42 UTC (permalink / raw)
  To: pve-devel

Certain cluster maintenance tasks, such as reconfiguring the network
or the cluster communication stack, can cause temporary quorum loss or
network partitions. Normally, HA would trigger self-fencing in such
situations, disrupting services unnecessarily.

Add a disarm/arm mechanism that releases all watchdogs cluster-wide,
allowing such work to be done safely.

A 'resource-mode' parameter controls how HA managed resources are
handled while disarmed (the current state of resources is not
affected):
- 'freeze': new commands and state changes are not applied, just like
  what's done automatically when restarting an LRM.
- 'ignore': resources are removed from HA tracking and can be managed
  as if they were not HA managed.

After disarm is requested, the CRM freezes or removes services and
waits for each LRM to finish active workers and release its agent
lock and watchdog. Once all LRMs are idle, the CRM releases its own
watchdog too. The CRM keeps the manager lock throughout so it can
process arm-ha to reverse the process.

Disarm is deferred while any services are being fenced or recovered.
The disarm state is preserved across CRM master changes. Maintenance
mode takes priority over disarm in the LRM.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---

changes v1 -> v2:
  - ignore mode: preserve relevant parts of service status instead of
    fully clearing it, so runtime state like maintenance_node survives;
    recheck node info from service config on arm-ha for manual
    migrations during disarm. Similar for freeze.
  - wrap add/remove service loops in single disarm guard [Dominik]
  - validate resource mode in simulator command handler [Dominik]
  - use 'notice' log level for duplicate disarm-when-not-armed or
    arm-when-not-disarmed, previously one was warn and one info.
  - add tests: ignore with manual-migrate and node recheck,
    maintenance+ignore, maintenance+ignore+manual-migrate, double
    disarm/arm idempotency, error state service during disarm

 src/PVE/HA/CRM.pm                             |  33 ++-
 src/PVE/HA/Config.pm                          |   5 +
 src/PVE/HA/LRM.pm                             |  31 ++-
 src/PVE/HA/Manager.pm                         | 197 ++++++++++++++++--
 src/PVE/HA/Sim/Hardware.pm                    |  24 ++-
 src/test/test-disarm-crm-stop1/README         |  13 ++
 src/test/test-disarm-crm-stop1/cmdlist        |   6 +
 .../test-disarm-crm-stop1/hardware_status     |   5 +
 src/test/test-disarm-crm-stop1/log.expect     |  66 ++++++
 src/test/test-disarm-crm-stop1/manager_status |   1 +
 src/test/test-disarm-crm-stop1/service_config |   5 +
 src/test/test-disarm-double1/cmdlist          |   7 +
 src/test/test-disarm-double1/hardware_status  |   5 +
 src/test/test-disarm-double1/log.expect       |  53 +++++
 src/test/test-disarm-double1/manager_status   |   1 +
 src/test/test-disarm-double1/service_config   |   4 +
 src/test/test-disarm-failing-service1/cmdlist |   6 +
 .../hardware_status                           |   5 +
 .../test-disarm-failing-service1/log.expect   | 125 +++++++++++
 .../manager_status                            |   1 +
 .../service_config                            |   4 +
 src/test/test-disarm-fence1/cmdlist           |   9 +
 src/test/test-disarm-fence1/hardware_status   |   5 +
 src/test/test-disarm-fence1/log.expect        |  78 +++++++
 src/test/test-disarm-fence1/manager_status    |   1 +
 src/test/test-disarm-fence1/service_config    |   5 +
 src/test/test-disarm-frozen1/README           |  10 +
 src/test/test-disarm-frozen1/cmdlist          |   5 +
 src/test/test-disarm-frozen1/hardware_status  |   5 +
 src/test/test-disarm-frozen1/log.expect       |  59 ++++++
 src/test/test-disarm-frozen1/manager_status   |   1 +
 src/test/test-disarm-frozen1/service_config   |   5 +
 src/test/test-disarm-ignored1/README          |  10 +
 src/test/test-disarm-ignored1/cmdlist         |   5 +
 src/test/test-disarm-ignored1/hardware_status |   5 +
 src/test/test-disarm-ignored1/log.expect      |  50 +++++
 src/test/test-disarm-ignored1/manager_status  |   1 +
 src/test/test-disarm-ignored1/service_config  |   5 +
 src/test/test-disarm-ignored2/cmdlist         |   6 +
 src/test/test-disarm-ignored2/hardware_status |   5 +
 src/test/test-disarm-ignored2/log.expect      |  60 ++++++
 src/test/test-disarm-ignored2/manager_status  |   1 +
 src/test/test-disarm-ignored2/service_config  |   5 +
 src/test/test-disarm-maintenance1/cmdlist     |   7 +
 .../test-disarm-maintenance1/hardware_status  |   5 +
 src/test/test-disarm-maintenance1/log.expect  |  79 +++++++
 .../test-disarm-maintenance1/manager_status   |   1 +
 .../test-disarm-maintenance1/service_config   |   5 +
 src/test/test-disarm-maintenance2/cmdlist     |   7 +
 .../test-disarm-maintenance2/hardware_status  |   5 +
 src/test/test-disarm-maintenance2/log.expect  |  78 +++++++
 .../test-disarm-maintenance2/manager_status   |   1 +
 .../test-disarm-maintenance2/service_config   |   5 +
 src/test/test-disarm-maintenance3/cmdlist     |   8 +
 .../test-disarm-maintenance3/hardware_status  |   5 +
 src/test/test-disarm-maintenance3/log.expect  |  80 +++++++
 .../test-disarm-maintenance3/manager_status   |   1 +
 .../test-disarm-maintenance3/service_config   |   5 +
 src/test/test-disarm-relocate1/README         |   3 +
 src/test/test-disarm-relocate1/cmdlist        |   7 +
 .../test-disarm-relocate1/hardware_status     |   5 +
 src/test/test-disarm-relocate1/log.expect     |  51 +++++
 src/test/test-disarm-relocate1/manager_status |   1 +
 src/test/test-disarm-relocate1/service_config |   4 +
 64 files changed, 1264 insertions(+), 32 deletions(-)
 create mode 100644 src/test/test-disarm-crm-stop1/README
 create mode 100644 src/test/test-disarm-crm-stop1/cmdlist
 create mode 100644 src/test/test-disarm-crm-stop1/hardware_status
 create mode 100644 src/test/test-disarm-crm-stop1/log.expect
 create mode 100644 src/test/test-disarm-crm-stop1/manager_status
 create mode 100644 src/test/test-disarm-crm-stop1/service_config
 create mode 100644 src/test/test-disarm-double1/cmdlist
 create mode 100644 src/test/test-disarm-double1/hardware_status
 create mode 100644 src/test/test-disarm-double1/log.expect
 create mode 100644 src/test/test-disarm-double1/manager_status
 create mode 100644 src/test/test-disarm-double1/service_config
 create mode 100644 src/test/test-disarm-failing-service1/cmdlist
 create mode 100644 src/test/test-disarm-failing-service1/hardware_status
 create mode 100644 src/test/test-disarm-failing-service1/log.expect
 create mode 100644 src/test/test-disarm-failing-service1/manager_status
 create mode 100644 src/test/test-disarm-failing-service1/service_config
 create mode 100644 src/test/test-disarm-fence1/cmdlist
 create mode 100644 src/test/test-disarm-fence1/hardware_status
 create mode 100644 src/test/test-disarm-fence1/log.expect
 create mode 100644 src/test/test-disarm-fence1/manager_status
 create mode 100644 src/test/test-disarm-fence1/service_config
 create mode 100644 src/test/test-disarm-frozen1/README
 create mode 100644 src/test/test-disarm-frozen1/cmdlist
 create mode 100644 src/test/test-disarm-frozen1/hardware_status
 create mode 100644 src/test/test-disarm-frozen1/log.expect
 create mode 100644 src/test/test-disarm-frozen1/manager_status
 create mode 100644 src/test/test-disarm-frozen1/service_config
 create mode 100644 src/test/test-disarm-ignored1/README
 create mode 100644 src/test/test-disarm-ignored1/cmdlist
 create mode 100644 src/test/test-disarm-ignored1/hardware_status
 create mode 100644 src/test/test-disarm-ignored1/log.expect
 create mode 100644 src/test/test-disarm-ignored1/manager_status
 create mode 100644 src/test/test-disarm-ignored1/service_config
 create mode 100644 src/test/test-disarm-ignored2/cmdlist
 create mode 100644 src/test/test-disarm-ignored2/hardware_status
 create mode 100644 src/test/test-disarm-ignored2/log.expect
 create mode 100644 src/test/test-disarm-ignored2/manager_status
 create mode 100644 src/test/test-disarm-ignored2/service_config
 create mode 100644 src/test/test-disarm-maintenance1/cmdlist
 create mode 100644 src/test/test-disarm-maintenance1/hardware_status
 create mode 100644 src/test/test-disarm-maintenance1/log.expect
 create mode 100644 src/test/test-disarm-maintenance1/manager_status
 create mode 100644 src/test/test-disarm-maintenance1/service_config
 create mode 100644 src/test/test-disarm-maintenance2/cmdlist
 create mode 100644 src/test/test-disarm-maintenance2/hardware_status
 create mode 100644 src/test/test-disarm-maintenance2/log.expect
 create mode 100644 src/test/test-disarm-maintenance2/manager_status
 create mode 100644 src/test/test-disarm-maintenance2/service_config
 create mode 100644 src/test/test-disarm-maintenance3/cmdlist
 create mode 100644 src/test/test-disarm-maintenance3/hardware_status
 create mode 100644 src/test/test-disarm-maintenance3/log.expect
 create mode 100644 src/test/test-disarm-maintenance3/manager_status
 create mode 100644 src/test/test-disarm-maintenance3/service_config
 create mode 100644 src/test/test-disarm-relocate1/README
 create mode 100644 src/test/test-disarm-relocate1/cmdlist
 create mode 100644 src/test/test-disarm-relocate1/hardware_status
 create mode 100644 src/test/test-disarm-relocate1/log.expect
 create mode 100644 src/test/test-disarm-relocate1/manager_status
 create mode 100644 src/test/test-disarm-relocate1/service_config

diff --git a/src/PVE/HA/CRM.pm b/src/PVE/HA/CRM.pm
index 2739763..a76cf67 100644
--- a/src/PVE/HA/CRM.pm
+++ b/src/PVE/HA/CRM.pm
@@ -104,9 +104,17 @@ sub get_protected_ha_manager_lock {
         if ($haenv->get_ha_manager_lock()) {
             if ($self->{ha_manager_wd}) {
                 $haenv->watchdog_update($self->{ha_manager_wd});
-            } else {
-                my $wfh = $haenv->watchdog_open();
-                $self->{ha_manager_wd} = $wfh;
+            } elsif (!$self->{disarmed}) {
+                # check on-disk disarm state to avoid briefly opening a watchdog when taking
+                # over as new master while the stack is already fully disarmed
+                my $ms = eval { $haenv->read_manager_status() };
+                if ($ms && $ms->{disarm} && $ms->{disarm}->{state} eq 'disarmed') {
+                    $haenv->log('info', "taking over as disarmed master, skipping watchdog");
+                    $self->{disarmed} = 1;
+                } else {
+                    my $wfh = $haenv->watchdog_open();
+                    $self->{ha_manager_wd} = $wfh;
+                }
             }
             return 1;
         }
@@ -211,6 +219,10 @@ sub can_get_active {
                 if (scalar($ss->%*)) {
                     return 1; # need to get active to clean up stale service status entries
                 }
+
+                if ($manager_status->{disarm}) {
+                    return 1; # stay active while HA stack is disarmed
+                }
             }
             return 0; # no services, no node in maintenance mode, and no crm cmds -> can stay idle
         }
@@ -232,6 +244,9 @@ sub allowed_to_get_idle {
     my $manager_status = get_manager_status_guarded($haenv);
     return 0 if !$self->is_cluster_and_ha_healthy($manager_status);
 
+    # don't go idle while HA stack is disarmed - need to stay active to process arm-ha
+    return 0 if $manager_status->{disarm};
+
     my $conf = eval { $haenv->read_service_config() };
     if (my $err = $@) {
         $haenv->log('err', "could not read service config: $err");
@@ -379,6 +394,18 @@ sub work {
                 }
 
                 $manager->manage();
+
+                if ($manager->is_fully_disarmed()) {
+                    if (!$self->{disarmed}) {
+                        $haenv->log('info', "HA stack fully disarmed, releasing CRM watchdog");
+                        give_up_watchdog_protection($self);
+                        $self->{disarmed} = 1;
+                    }
+                } elsif ($self->{disarmed}) {
+                    $haenv->log('info', "re-arming HA stack");
+                    $self->{disarmed} = 0;
+                    # watchdog will be re-opened by get_protected_ha_manager_lock next iteration
+                }
             }
         };
         if (my $err = $@) {
diff --git a/src/PVE/HA/Config.pm b/src/PVE/HA/Config.pm
index 19eec2a..ad7f8a4 100644
--- a/src/PVE/HA/Config.pm
+++ b/src/PVE/HA/Config.pm
@@ -362,6 +362,11 @@ my $service_check_ha_state = sub {
         if (!defined($has_state)) {
             # ignored service behave as if they were not managed by HA
             return 0 if defined($d->{state}) && $d->{state} eq 'ignored';
+            # cluster-wide disarm with ignore mode - resources can be managed directly
+            my $ms = cfs_read_file($manager_status_filename);
+            if (my $disarm = $ms->{disarm}) {
+                return 0 if $disarm->{mode} eq 'ignore';
+            }
             return 1;
         }
 
diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index 09a965c..9545018 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -37,7 +37,7 @@ sub new {
         restart_tries => {},
         shutdown_request => 0,
         shutdown_errors => 0,
-        # mode can be: active, reboot, shutdown, restart, maintenance
+        # mode can be: active, reboot, shutdown, restart, maintenance, disarm
         mode => 'active',
         cluster_state_update => 0,
         active_idle_rounds => 0,
@@ -212,7 +212,9 @@ sub update_service_status {
             my $request = $ms->{node_request}->{$nodename} // {};
             if ($request->{maintenance}) {
                 $self->{mode} = 'maintenance';
-            } elsif ($self->{mode} eq 'maintenance') {
+            } elsif ($ms->{disarm}) {
+                $self->{mode} = 'disarm';
+            } elsif ($self->{mode} eq 'maintenance' || $self->{mode} eq 'disarm') {
                 $self->{mode} = 'active';
             }
         }
@@ -359,7 +361,9 @@ sub work {
 
         my $service_count = $self->active_service_count();
 
-        if (!$fence_request && $service_count && $haenv->quorate()) {
+        if ($self->{mode} eq 'disarm') {
+            # stay idle while disarmed, don't acquire lock
+        } elsif (!$fence_request && $service_count && $haenv->quorate()) {
             if ($self->get_protected_ha_agent_lock()) {
                 $self->set_local_status({ state => 'active' });
             }
@@ -382,6 +386,13 @@ sub work {
             $self->set_local_status({ state => 'lost_agent_lock' });
         } elsif ($self->is_maintenance_requested()) {
             $self->set_local_status({ state => 'maintenance' });
+        } elsif ($self->{mode} eq 'disarm' && !$self->run_workers()) {
+            $haenv->log('info', "HA disarm requested, releasing agent lock and watchdog");
+            # safety: disarming requested, no fence request (handled in earlier if-branch) and no
+            # running workers anymore, so safe to go idle.
+            $haenv->release_ha_agent_lock();
+            give_up_watchdog_protection($self);
+            $self->set_local_status({ state => 'wait_for_agent_lock' });
         } else {
             if (!$self->has_configured_service_on_local_node() && !$self->run_workers()) {
                 # no active service configured for this node and all (old) workers are done
@@ -409,6 +420,20 @@ sub work {
                 "node needs to be fenced during maintenance mode - releasing agent_lock\n",
             );
             $self->set_local_status({ state => 'lost_agent_lock' });
+        } elsif (
+            $self->{mode} eq 'disarm'
+            && !$self->active_service_count()
+            && !$self->run_workers()
+        ) {
+            # disarm takes priority - release lock and watchdog, go idle
+            $haenv->log(
+                'info',
+                "HA disarm requested during maintenance, releasing agent lock and watchdog",
+            );
+            # safety: no active services and no running workers, so safe to go idle.
+            $haenv->release_ha_agent_lock();
+            give_up_watchdog_protection($self);
+            $self->set_local_status({ state => 'wait_for_agent_lock' });
         } elsif ($self->active_service_count() || $self->run_workers()) {
             # keep the lock and watchdog as long as not all services cleared the node
             if (!$self->get_protected_ha_agent_lock()) {
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index b1dbe6a..aa29858 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -75,6 +75,9 @@ sub new {
     # on change of active master.
     $self->{ms}->{node_request} = $old_ms->{node_request} if defined($old_ms->{node_request});
 
+    # preserve disarm state across CRM master changes
+    $self->{ms}->{disarm} = $old_ms->{disarm} if defined($old_ms->{disarm});
+
     $self->update_crs_scheduler_mode(); # initial set, we update it once every loop
 
     return $self;
@@ -472,7 +475,12 @@ sub update_crm_commands {
             my $node = $1;
 
             my $state = $ns->get_node_state($node);
-            if ($state eq 'online') {
+            if ($ms->{disarm}) {
+                $haenv->log(
+                    'warn',
+                    "ignoring maintenance command for node $node - HA stack is disarmed",
+                );
+            } elsif ($state eq 'online') {
                 $ms->{node_request}->{$node}->{maintenance} = 1;
             } elsif ($state eq 'maintenance') {
                 $haenv->log(
@@ -493,6 +501,51 @@ sub update_crm_commands {
                 );
             }
             delete $ms->{node_request}->{$node}->{maintenance}; # gets flushed out at the end of the CRM loop
+        } elsif ($cmd =~ m/^disarm-ha\s+(freeze|ignore)$/) {
+            my $mode = $1;
+
+            if ($ms->{disarm}) {
+                $haenv->log(
+                    'notice',
+                    "ignoring disarm-ha command - already in disarm state ($ms->{disarm}->{state})",
+                );
+            } else {
+                $haenv->log('info', "got crm command: disarm-ha $mode");
+                if ($mode eq 'ignore') {
+                    for my $sid (sort keys %$ss) {
+                        $haenv->log(
+                            'info', "disarm: suspending HA tracking for service '$sid'",
+                        );
+                    }
+                }
+                $ms->{disarm} = { mode => $mode, state => 'disarming' };
+            }
+        } elsif ($cmd =~ m/^arm-ha$/) {
+            if ($ms->{disarm}) {
+                $haenv->log('info', "got crm command: arm-ha");
+
+                # recheck node info after ignore mode, as services may have been manually
+                # migrated while HA tracking was suspended
+                if ($ms->{disarm}->{mode} eq 'ignore') {
+                    my $sc = $haenv->read_service_config();
+                    for my $sid (sort keys %$ss) {
+                        my $cd = $sc->{$sid};
+                        next if !$cd;
+                        next if $cd->{node} eq $ss->{$sid}->{node};
+                        $haenv->log(
+                            'info',
+                            "service '$sid': updating node"
+                                . " '$ss->{$sid}->{node}' => '$cd->{node}'"
+                                . " (changed while disarmed)",
+                        );
+                        $ss->{$sid}->{node} = $cd->{node};
+                    }
+                }
+
+                delete $ms->{disarm};
+            } else {
+                $haenv->log('notice', "ignoring arm-ha command - HA stack is not disarmed");
+            }
         } else {
             $haenv->log('err', "unable to parse crm command: $cmd");
         }
@@ -631,6 +684,94 @@ sub try_persistent_group_migration {
     }
 }
 
+sub handle_disarm {
+    my ($self, $disarm, $ss, $lrm_modes) = @_;
+
+    my $haenv = $self->{haenv};
+    my $ns = $self->{ns};
+
+    # defer disarm if any services are in a transient state that needs the state machine to resolve
+    for my $sid (sort keys %$ss) {
+        my $state = $ss->{$sid}->{state};
+        if ($state eq 'fence' || $state eq 'recovery') {
+            $haenv->log(
+                'warn', "deferring disarm - service '$sid' is in '$state' state",
+            );
+            return 0; # let manage() continue so fence/recovery can progress
+        }
+        if ($state eq 'migrate' || $state eq 'relocate') {
+            $haenv->log(
+                'info', "deferring disarm - service '$sid' is in '$state' state",
+            );
+            return 0; # let manage() continue so migration can complete
+        }
+    }
+
+    my $mode = $disarm->{mode};
+
+    # prune stale runtime data (failed_nodes, cmd, target, ...) so the state machine starts
+    # fresh on re-arm; preserve maintenance_node for correct return behavior
+    my %keep_keys = map { $_ => 1 } qw(state node uid maintenance_node);
+
+    if ($mode eq 'freeze') {
+        for my $sid (sort keys %$ss) {
+            my $sd = $ss->{$sid};
+            my $state = $sd->{state};
+            next if $state eq 'freeze'; # already frozen
+            if (
+                $state eq 'started'
+                || $state eq 'stopped'
+                || $state eq 'request_stop'
+                || $state eq 'request_start'
+                || $state eq 'request_start_balance'
+                || $state eq 'error'
+            ) {
+                $haenv->log('info', "disarm: freezing service '$sid' (was '$state')");
+                delete $sd->{$_} for grep { !$keep_keys{$_} } keys %$sd;
+                $sd->{state} = 'freeze';
+                $sd->{uid} = compute_new_uuid('freeze');
+            }
+        }
+    } elsif ($mode eq 'ignore') {
+        # keep $ss intact; the disarm flag in $ms causes service loops and vm_is_ha_managed()
+        # to skip these services while disarmed
+        for my $sid (sort keys %$ss) {
+            my $sd = $ss->{$sid};
+            delete $sd->{$_} for grep { !$keep_keys{$_} } keys %$sd;
+        }
+    }
+
+    # check if all online LRMs have entered disarm mode
+    my $all_disarmed = 1;
+    my $online_nodes = $ns->list_online_nodes();
+
+    for my $node (@$online_nodes) {
+        my $lrm_mode = $lrm_modes->{$node} // 'unknown';
+        if ($lrm_mode ne 'disarm') {
+            $all_disarmed = 0;
+            last;
+        }
+    }
+
+    if ($all_disarmed && $disarm->{state} ne 'disarmed') {
+        $haenv->log('info', "all LRMs disarmed, HA stack is now fully disarmed");
+        $disarm->{state} = 'disarmed';
+    }
+
+    # once disarmed, stay disarmed - a returning node's LRM will catch up within one cycle
+    $self->{all_lrms_disarmed} = $disarm->{state} eq 'disarmed';
+
+    $self->flush_master_status();
+
+    return 1;
+}
+
+sub is_fully_disarmed {
+    my ($self) = @_;
+
+    return $self->{all_lrms_disarmed};
+}
+
 sub manage {
     my ($self) = @_;
 
@@ -657,31 +798,34 @@ sub manage {
 
     # compute new service status
 
-    # add new service
-    foreach my $sid (sort keys %$sc) {
-        next if $ss->{$sid}; # already there
-        my $cd = $sc->{$sid};
-        next if $cd->{state} eq 'ignored';
+    # skip service add/remove when disarmed - handle_disarm manages service status
+    if (!$ms->{disarm}) {
+        # add new service
+        foreach my $sid (sort keys %$sc) {
+            next if $ss->{$sid}; # already there
+            my $cd = $sc->{$sid};
+            next if $cd->{state} eq 'ignored';
 
-        $haenv->log('info', "adding new service '$sid' on node '$cd->{node}'");
-        # assume we are running to avoid relocate running service at add
-        my $state = ($cd->{state} eq 'started') ? 'request_start' : 'request_stop';
-        $ss->{$sid} = {
-            state => $state,
-            node => $cd->{node},
-            uid => compute_new_uuid('started'),
-        };
-    }
+            $haenv->log('info', "adding new service '$sid' on node '$cd->{node}'");
+            # assume we are running to avoid relocate running service at add
+            my $state = ($cd->{state} eq 'started') ? 'request_start' : 'request_stop';
+            $ss->{$sid} = {
+                state => $state,
+                node => $cd->{node},
+                uid => compute_new_uuid('started'),
+            };
+        }
 
-    # remove stale or ignored services from manager state
-    foreach my $sid (keys %$ss) {
-        next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
+        # remove stale or ignored services from manager state
+        foreach my $sid (keys %$ss) {
+            next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
 
-        my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
-        $haenv->log('info', "removing stale service '$sid' ($reason)");
+            my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
+            $haenv->log('info', "removing stale service '$sid' ($reason)");
 
-        # remove all service related state information
-        delete $ss->{$sid};
+            # remove all service related state information
+            delete $ss->{$sid};
+        }
     }
 
     $self->recompute_online_node_usage();
@@ -713,6 +857,15 @@ sub manage {
 
     $self->update_crm_commands();
 
+    if (my $disarm = $ms->{disarm}) {
+        if ($self->handle_disarm($disarm, $ss, $lrm_modes)) {
+            return; # disarm active and progressing, skip normal service state machine
+        }
+        # disarm deferred (e.g. due to active fencing) - fall through to let it complete
+    }
+
+    $self->{all_lrms_disarmed} = 0;
+
     for (;;) {
         my $repeat = 0;
 
diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
index 301c391..f6857bd 100644
--- a/src/PVE/HA/Sim/Hardware.pm
+++ b/src/PVE/HA/Sim/Hardware.pm
@@ -835,6 +835,15 @@ sub sim_hardware_cmd {
                 || $action eq 'disable-node-maintenance'
             ) {
                 $self->queue_crm_commands_nolock("$action $node");
+            } elsif ($action eq 'disarm-ha') {
+                my $mode = $params[0];
+
+                die "sim_hardware_cmd: unknown resource mode '$mode'\n"
+                    if $mode !~ m/^(freeze|ignore)$/;
+
+                $self->queue_crm_commands_nolock("disarm-ha $mode");
+            } elsif ($action eq 'arm-ha') {
+                $self->queue_crm_commands_nolock("arm-ha");
             } else {
                 die "sim_hardware_cmd: unknown action '$action'";
             }
@@ -892,10 +901,17 @@ sub sim_hardware_cmd {
                 my $current_node = $conf->{$sid}->{node}
                     || die "sim_hardware_cmd: service '$sid' has no node\n";
 
-                die "sim_hardware_cmd: manual-migrate requires service"
-                    . " in 'ignored' state\n"
-                    if !defined($conf->{$sid}->{state})
-                    || $conf->{$sid}->{state} ne 'ignored';
+                my $svc_ignored = defined($conf->{$sid}->{state})
+                    && $conf->{$sid}->{state} eq 'ignored';
+
+                my $ms = PVE::HA::Tools::read_json_from_file(
+                    "$self->{statusdir}/manager_status", {},
+                );
+                my $disarm_ignored = $ms->{disarm} && $ms->{disarm}->{mode} eq 'ignore';
+
+                die "sim_hardware_cmd: manual-migrate requires service in"
+                    . " 'ignored' state or disarm-ha ignore mode\n"
+                    if !$svc_ignored && !$disarm_ignored;
 
                 $self->change_service_location($sid, $current_node, $param);
 
diff --git a/src/test/test-disarm-crm-stop1/README b/src/test/test-disarm-crm-stop1/README
new file mode 100644
index 0000000..5f81497
--- /dev/null
+++ b/src/test/test-disarm-crm-stop1/README
@@ -0,0 +1,13 @@
+Test CRM master takeover while HA stack is fully disarmed.
+
+Verify that when the CRM master is stopped during full disarm, a slave
+takes over cleanly without briefly opening a watchdog, and that arming
+via the new master works correctly.
+
+1. Start 3 nodes with services
+2. Disarm HA with freeze resource mode
+3. Wait for full disarm (all LRMs disarmed, CRM watchdog released)
+4. Stop CRM on master node (node1)
+5. Slave on node2 takes over as new master, preserving disarm state
+6. Arm HA via new master
+7. Services unfreeze and resume normal operation
diff --git a/src/test/test-disarm-crm-stop1/cmdlist b/src/test/test-disarm-crm-stop1/cmdlist
new file mode 100644
index 0000000..01e9cd9
--- /dev/null
+++ b/src/test/test-disarm-crm-stop1/cmdlist
@@ -0,0 +1,6 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "crm node1 disarm-ha freeze" ],
+    [ "crm node1 stop" ],
+    [ "crm node2 arm-ha" ]
+]
\ No newline at end of file
diff --git a/src/test/test-disarm-crm-stop1/hardware_status b/src/test/test-disarm-crm-stop1/hardware_status
new file mode 100644
index 0000000..4990fd0
--- /dev/null
+++ b/src/test/test-disarm-crm-stop1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
\ No newline at end of file
diff --git a/src/test/test-disarm-crm-stop1/log.expect b/src/test/test-disarm-crm-stop1/log.expect
new file mode 100644
index 0000000..880008f
--- /dev/null
+++ b/src/test/test-disarm-crm-stop1/log.expect
@@ -0,0 +1,66 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     40    node1/crm: service 'vm:102': state changed from 'request_stop' to 'stopped'
+info     40    node1/crm: service 'vm:103': state changed from 'request_stop' to 'stopped'
+info    120      cmdlist: execute crm node1 disarm-ha freeze
+info    120    node1/crm: got crm command: disarm-ha freeze
+info    120    node1/crm: disarm: freezing service 'vm:101' (was 'started')
+info    120    node1/crm: disarm: freezing service 'vm:102' (was 'stopped')
+info    120    node1/crm: disarm: freezing service 'vm:103' (was 'stopped')
+info    121    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    121    node1/lrm: status change active => wait_for_agent_lock
+info    123    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    123    node2/lrm: status change active => wait_for_agent_lock
+info    125    node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info    125    node3/lrm: status change active => wait_for_agent_lock
+info    140    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    140    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    220      cmdlist: execute crm node1 stop
+info    220    node1/crm: server received shutdown request
+info    220    node1/crm: voluntary release CRM lock
+info    221    node1/crm: exit (loop end)
+info    222    node2/crm: got lock 'ha_manager_lock'
+info    222    node2/crm: taking over as disarmed master, skipping watchdog
+info    222    node2/crm: status change slave => master
+info    320      cmdlist: execute crm node2 arm-ha
+info    321    node2/crm: got crm command: arm-ha
+info    321    node2/crm: re-arming HA stack
+info    341    node2/crm: service 'vm:101': state changed from 'freeze' to 'started'
+info    341    node2/crm: service 'vm:102': state changed from 'freeze' to 'request_stop'
+info    341    node2/crm: service 'vm:103': state changed from 'freeze' to 'request_stop'
+info    342    node2/lrm: got lock 'ha_agent_node2_lock'
+info    342    node2/lrm: status change wait_for_agent_lock => active
+info    344    node3/lrm: got lock 'ha_agent_node3_lock'
+info    344    node3/lrm: status change wait_for_agent_lock => active
+info    360    node1/lrm: got lock 'ha_agent_node1_lock'
+info    360    node1/lrm: status change wait_for_agent_lock => active
+info    361    node2/crm: service 'vm:102': state changed from 'request_stop' to 'stopped'
+info    361    node2/crm: service 'vm:103': state changed from 'request_stop' to 'stopped'
+info    920     hardware: exit simulation - done
diff --git a/src/test/test-disarm-crm-stop1/manager_status b/src/test/test-disarm-crm-stop1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-crm-stop1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-crm-stop1/service_config b/src/test/test-disarm-crm-stop1/service_config
new file mode 100644
index 0000000..c2ddbce
--- /dev/null
+++ b/src/test/test-disarm-crm-stop1/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "stopped" },
+    "vm:103": { "node": "node3", "state": "disabled" }
+}
\ No newline at end of file
diff --git a/src/test/test-disarm-double1/cmdlist b/src/test/test-disarm-double1/cmdlist
new file mode 100644
index 0000000..36ecb69
--- /dev/null
+++ b/src/test/test-disarm-double1/cmdlist
@@ -0,0 +1,7 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "crm node1 disarm-ha freeze" ],
+    [ "crm node1 disarm-ha ignore" ],
+    [ "crm node1 arm-ha" ],
+    [ "crm node1 arm-ha" ]
+]
diff --git a/src/test/test-disarm-double1/hardware_status b/src/test/test-disarm-double1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-double1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-double1/log.expect b/src/test/test-disarm-double1/log.expect
new file mode 100644
index 0000000..c45f586
--- /dev/null
+++ b/src/test/test-disarm-double1/log.expect
@@ -0,0 +1,53 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node2)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service vm:102
+info     23    node2/lrm: service status vm:102 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info    120      cmdlist: execute crm node1 disarm-ha freeze
+info    120    node1/crm: got crm command: disarm-ha freeze
+info    120    node1/crm: disarm: freezing service 'vm:101' (was 'started')
+info    120    node1/crm: disarm: freezing service 'vm:102' (was 'started')
+info    121    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    121    node1/lrm: status change active => wait_for_agent_lock
+info    123    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    123    node2/lrm: status change active => wait_for_agent_lock
+info    140    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    140    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    220      cmdlist: execute crm node1 disarm-ha ignore
+noti    220    node1/crm: ignoring disarm-ha command - already in disarm state (disarmed)
+info    320      cmdlist: execute crm node1 arm-ha
+info    320    node1/crm: got crm command: arm-ha
+info    320    node1/crm: re-arming HA stack
+info    340    node1/crm: service 'vm:101': state changed from 'freeze' to 'started'
+info    340    node1/crm: service 'vm:102': state changed from 'freeze' to 'started'
+info    341    node1/lrm: got lock 'ha_agent_node1_lock'
+info    341    node1/lrm: status change wait_for_agent_lock => active
+info    343    node2/lrm: got lock 'ha_agent_node2_lock'
+info    343    node2/lrm: status change wait_for_agent_lock => active
+info    420      cmdlist: execute crm node1 arm-ha
+noti    420    node1/crm: ignoring arm-ha command - HA stack is not disarmed
+info   1020     hardware: exit simulation - done
diff --git a/src/test/test-disarm-double1/manager_status b/src/test/test-disarm-double1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-double1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-double1/service_config b/src/test/test-disarm-double1/service_config
new file mode 100644
index 0000000..0336d09
--- /dev/null
+++ b/src/test/test-disarm-double1/service_config
@@ -0,0 +1,4 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" }
+}
diff --git a/src/test/test-disarm-failing-service1/cmdlist b/src/test/test-disarm-failing-service1/cmdlist
new file mode 100644
index 0000000..97ae018
--- /dev/null
+++ b/src/test/test-disarm-failing-service1/cmdlist
@@ -0,0 +1,6 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "service fa:1001 disabled" ],
+    [ "crm node1 disarm-ha freeze" ],
+    [ "crm node1 arm-ha" ]
+]
diff --git a/src/test/test-disarm-failing-service1/hardware_status b/src/test/test-disarm-failing-service1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-failing-service1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-failing-service1/log.expect b/src/test/test-disarm-failing-service1/log.expect
new file mode 100644
index 0000000..eddf8fe
--- /dev/null
+++ b/src/test/test-disarm-failing-service1/log.expect
@@ -0,0 +1,125 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'fa:1001' on node 'node2'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: service 'fa:1001': state changed from 'request_start' to 'started'  (node = node2)
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service fa:1001
+info     23    node2/lrm: service status fa:1001 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info    120      cmdlist: execute service fa:1001 disabled
+info    120    node1/crm: service 'fa:1001': state changed from 'started' to 'request_stop'
+info    123    node2/lrm: stopping service fa:1001
+info    123    node2/lrm: unable to stop service fa:1001 (still running)
+err     140    node1/crm: service 'fa:1001' stop failed (exit code 1)
+info    140    node1/crm: service 'fa:1001': state changed from 'request_stop' to 'error'
+info    140    node1/crm: service 'fa:1001': state changed from 'error' to 'stopped'
+info    143    node2/lrm: stopping service fa:1001
+info    143    node2/lrm: unable to stop service fa:1001 (still running)
+info    163    node2/lrm: stopping service fa:1001
+info    163    node2/lrm: unable to stop service fa:1001 (still running)
+info    183    node2/lrm: stopping service fa:1001
+info    183    node2/lrm: unable to stop service fa:1001 (still running)
+info    203    node2/lrm: stopping service fa:1001
+info    203    node2/lrm: unable to stop service fa:1001 (still running)
+info    220      cmdlist: execute crm node1 disarm-ha freeze
+info    220    node1/crm: got crm command: disarm-ha freeze
+info    220    node1/crm: disarm: freezing service 'fa:1001' (was 'stopped')
+info    220    node1/crm: disarm: freezing service 'vm:101' (was 'started')
+info    221    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    221    node1/lrm: status change active => wait_for_agent_lock
+info    223    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    223    node2/lrm: status change active => wait_for_agent_lock
+info    240    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    240    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    320      cmdlist: execute crm node1 arm-ha
+info    320    node1/crm: got crm command: arm-ha
+info    320    node1/crm: re-arming HA stack
+info    340    node1/crm: service 'fa:1001': state changed from 'freeze' to 'request_stop'
+info    340    node1/crm: service 'vm:101': state changed from 'freeze' to 'started'
+info    341    node1/lrm: got lock 'ha_agent_node1_lock'
+info    341    node1/lrm: status change wait_for_agent_lock => active
+info    343    node2/lrm: got lock 'ha_agent_node2_lock'
+info    343    node2/lrm: status change wait_for_agent_lock => active
+info    343    node2/lrm: stopping service fa:1001
+info    343    node2/lrm: unable to stop service fa:1001 (still running)
+err     360    node1/crm: service 'fa:1001' stop failed (exit code 1)
+info    360    node1/crm: service 'fa:1001': state changed from 'request_stop' to 'error'
+info    360    node1/crm: service 'fa:1001': state changed from 'error' to 'stopped'
+info    363    node2/lrm: stopping service fa:1001
+info    363    node2/lrm: unable to stop service fa:1001 (still running)
+info    383    node2/lrm: stopping service fa:1001
+info    383    node2/lrm: unable to stop service fa:1001 (still running)
+info    403    node2/lrm: stopping service fa:1001
+info    403    node2/lrm: unable to stop service fa:1001 (still running)
+info    423    node2/lrm: stopping service fa:1001
+info    423    node2/lrm: unable to stop service fa:1001 (still running)
+info    443    node2/lrm: stopping service fa:1001
+info    443    node2/lrm: unable to stop service fa:1001 (still running)
+info    463    node2/lrm: stopping service fa:1001
+info    463    node2/lrm: unable to stop service fa:1001 (still running)
+info    483    node2/lrm: stopping service fa:1001
+info    483    node2/lrm: unable to stop service fa:1001 (still running)
+info    503    node2/lrm: stopping service fa:1001
+info    503    node2/lrm: unable to stop service fa:1001 (still running)
+info    523    node2/lrm: stopping service fa:1001
+info    523    node2/lrm: unable to stop service fa:1001 (still running)
+info    543    node2/lrm: stopping service fa:1001
+info    543    node2/lrm: unable to stop service fa:1001 (still running)
+info    563    node2/lrm: stopping service fa:1001
+info    563    node2/lrm: unable to stop service fa:1001 (still running)
+info    583    node2/lrm: stopping service fa:1001
+info    583    node2/lrm: unable to stop service fa:1001 (still running)
+info    603    node2/lrm: stopping service fa:1001
+info    603    node2/lrm: unable to stop service fa:1001 (still running)
+info    623    node2/lrm: stopping service fa:1001
+info    623    node2/lrm: unable to stop service fa:1001 (still running)
+info    643    node2/lrm: stopping service fa:1001
+info    643    node2/lrm: unable to stop service fa:1001 (still running)
+info    663    node2/lrm: stopping service fa:1001
+info    663    node2/lrm: unable to stop service fa:1001 (still running)
+info    683    node2/lrm: stopping service fa:1001
+info    683    node2/lrm: unable to stop service fa:1001 (still running)
+info    703    node2/lrm: stopping service fa:1001
+info    703    node2/lrm: unable to stop service fa:1001 (still running)
+info    723    node2/lrm: stopping service fa:1001
+info    723    node2/lrm: unable to stop service fa:1001 (still running)
+info    743    node2/lrm: stopping service fa:1001
+info    743    node2/lrm: unable to stop service fa:1001 (still running)
+info    763    node2/lrm: stopping service fa:1001
+info    763    node2/lrm: unable to stop service fa:1001 (still running)
+info    783    node2/lrm: stopping service fa:1001
+info    783    node2/lrm: unable to stop service fa:1001 (still running)
+info    803    node2/lrm: stopping service fa:1001
+info    803    node2/lrm: unable to stop service fa:1001 (still running)
+info    823    node2/lrm: stopping service fa:1001
+info    823    node2/lrm: unable to stop service fa:1001 (still running)
+info    843    node2/lrm: stopping service fa:1001
+info    843    node2/lrm: unable to stop service fa:1001 (still running)
+info    863    node2/lrm: stopping service fa:1001
+info    863    node2/lrm: unable to stop service fa:1001 (still running)
+info    883    node2/lrm: stopping service fa:1001
+info    883    node2/lrm: unable to stop service fa:1001 (still running)
+info    903    node2/lrm: stopping service fa:1001
+info    903    node2/lrm: unable to stop service fa:1001 (still running)
+info    920     hardware: exit simulation - done
diff --git a/src/test/test-disarm-failing-service1/manager_status b/src/test/test-disarm-failing-service1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-failing-service1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-failing-service1/service_config b/src/test/test-disarm-failing-service1/service_config
new file mode 100644
index 0000000..f4f00e0
--- /dev/null
+++ b/src/test/test-disarm-failing-service1/service_config
@@ -0,0 +1,4 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "fa:1001": { "node": "node2", "state": "enabled" }
+}
diff --git a/src/test/test-disarm-fence1/cmdlist b/src/test/test-disarm-fence1/cmdlist
new file mode 100644
index 0000000..7473615
--- /dev/null
+++ b/src/test/test-disarm-fence1/cmdlist
@@ -0,0 +1,9 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "network node2 off" ],
+    [ "crm node1 disarm-ha freeze" ],
+    [],
+    [],
+    [],
+    [ "crm node1 arm-ha" ]
+]
diff --git a/src/test/test-disarm-fence1/hardware_status b/src/test/test-disarm-fence1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-fence1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-fence1/log.expect b/src/test/test-disarm-fence1/log.expect
new file mode 100644
index 0000000..9a56c5d
--- /dev/null
+++ b/src/test/test-disarm-fence1/log.expect
@@ -0,0 +1,78 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node2)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service vm:102
+info     23    node2/lrm: service status vm:102 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     40    node1/crm: service 'vm:103': state changed from 'request_stop' to 'stopped'
+info    120      cmdlist: execute network node2 off
+info    120    node1/crm: node 'node2': state changed from 'online' => 'unknown'
+info    122    node2/crm: status change slave => wait_for_quorum
+info    123    node2/lrm: status change active => lost_agent_lock
+info    160    node1/crm: service 'vm:102': state changed from 'started' to 'fence'
+info    160    node1/crm: node 'node2': state changed from 'unknown' => 'fence'
+emai    160    node1/crm: FENCE: Try to fence node 'node2'
+info    164     watchdog: execute power node2 off
+info    163    node2/crm: killed by poweroff
+info    164    node2/lrm: killed by poweroff
+info    164     hardware: server 'node2' stopped by poweroff (watchdog)
+info    220      cmdlist: execute crm node1 disarm-ha freeze
+info    220    node1/crm: got crm command: disarm-ha freeze
+warn    220    node1/crm: deferring disarm - service 'vm:102' is in 'fence' state
+info    221    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    221    node1/lrm: status change active => wait_for_agent_lock
+info    223    node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info    223    node3/lrm: status change active => wait_for_agent_lock
+warn    240    node1/crm: deferring disarm - service 'vm:102' is in 'fence' state
+info    240    node1/crm: got lock 'ha_agent_node2_lock'
+info    240    node1/crm: fencing: acknowledged - got agent lock for node 'node2'
+info    240    node1/crm: node 'node2': state changed from 'fence' => 'unknown'
+emai    240    node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node2'
+info    240    node1/crm: service 'vm:102': state changed from 'fence' to 'recovery'
+info    240    node1/crm: recover service 'vm:102' from fenced node 'node2' to node 'node3'
+info    240    node1/crm: service 'vm:102': state changed from 'recovery' to 'started'  (node = node3)
+info    260    node1/crm: disarm: freezing service 'vm:101' (was 'started')
+info    260    node1/crm: disarm: freezing service 'vm:102' (was 'started')
+info    260    node1/crm: disarm: freezing service 'vm:103' (was 'stopped')
+info    260    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    260    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    620      cmdlist: execute crm node1 arm-ha
+info    620    node1/crm: got crm command: arm-ha
+info    620    node1/crm: re-arming HA stack
+info    640    node1/crm: service 'vm:101': state changed from 'freeze' to 'started'
+info    640    node1/crm: service 'vm:102': state changed from 'freeze' to 'started'
+info    640    node1/crm: service 'vm:103': state changed from 'freeze' to 'request_stop'
+info    641    node1/lrm: got lock 'ha_agent_node1_lock'
+info    641    node1/lrm: status change wait_for_agent_lock => active
+info    643    node3/lrm: got lock 'ha_agent_node3_lock'
+info    643    node3/lrm: status change wait_for_agent_lock => active
+info    643    node3/lrm: starting service vm:102
+info    643    node3/lrm: service status vm:102 started
+info    660    node1/crm: service 'vm:103': state changed from 'request_stop' to 'stopped'
+info   1220     hardware: exit simulation - done
diff --git a/src/test/test-disarm-fence1/manager_status b/src/test/test-disarm-fence1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-fence1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-fence1/service_config b/src/test/test-disarm-fence1/service_config
new file mode 100644
index 0000000..0487834
--- /dev/null
+++ b/src/test/test-disarm-fence1/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "stopped" }
+}
diff --git a/src/test/test-disarm-frozen1/README b/src/test/test-disarm-frozen1/README
new file mode 100644
index 0000000..e68ea2c
--- /dev/null
+++ b/src/test/test-disarm-frozen1/README
@@ -0,0 +1,10 @@
+Test disarm-ha with freeze resource mode.
+
+Verify the full disarm cycle:
+1. Start 3 nodes with services
+2. Disarm HA with freeze resource mode
+3. All services should transition to freeze state
+4. LRMs should release locks and watchdogs (disarm mode)
+5. CRM should release watchdog once all LRMs disarmed
+6. Arm HA again
+7. Services should unfreeze and resume normal operation
diff --git a/src/test/test-disarm-frozen1/cmdlist b/src/test/test-disarm-frozen1/cmdlist
new file mode 100644
index 0000000..e6fc192
--- /dev/null
+++ b/src/test/test-disarm-frozen1/cmdlist
@@ -0,0 +1,5 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "crm node1 disarm-ha freeze" ],
+    [ "crm node1 arm-ha" ]
+]
\ No newline at end of file
diff --git a/src/test/test-disarm-frozen1/hardware_status b/src/test/test-disarm-frozen1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-frozen1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-frozen1/log.expect b/src/test/test-disarm-frozen1/log.expect
new file mode 100644
index 0000000..206f14e
--- /dev/null
+++ b/src/test/test-disarm-frozen1/log.expect
@@ -0,0 +1,59 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     40    node1/crm: service 'vm:102': state changed from 'request_stop' to 'stopped'
+info     40    node1/crm: service 'vm:103': state changed from 'request_stop' to 'stopped'
+info    120      cmdlist: execute crm node1 disarm-ha freeze
+info    120    node1/crm: got crm command: disarm-ha freeze
+info    120    node1/crm: disarm: freezing service 'vm:101' (was 'started')
+info    120    node1/crm: disarm: freezing service 'vm:102' (was 'stopped')
+info    120    node1/crm: disarm: freezing service 'vm:103' (was 'stopped')
+info    121    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    121    node1/lrm: status change active => wait_for_agent_lock
+info    123    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    123    node2/lrm: status change active => wait_for_agent_lock
+info    125    node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info    125    node3/lrm: status change active => wait_for_agent_lock
+info    140    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    140    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    220      cmdlist: execute crm node1 arm-ha
+info    220    node1/crm: got crm command: arm-ha
+info    220    node1/crm: re-arming HA stack
+info    240    node1/crm: service 'vm:101': state changed from 'freeze' to 'started'
+info    240    node1/crm: service 'vm:102': state changed from 'freeze' to 'request_stop'
+info    240    node1/crm: service 'vm:103': state changed from 'freeze' to 'request_stop'
+info    241    node1/lrm: got lock 'ha_agent_node1_lock'
+info    241    node1/lrm: status change wait_for_agent_lock => active
+info    243    node2/lrm: got lock 'ha_agent_node2_lock'
+info    243    node2/lrm: status change wait_for_agent_lock => active
+info    245    node3/lrm: got lock 'ha_agent_node3_lock'
+info    245    node3/lrm: status change wait_for_agent_lock => active
+info    260    node1/crm: service 'vm:102': state changed from 'request_stop' to 'stopped'
+info    260    node1/crm: service 'vm:103': state changed from 'request_stop' to 'stopped'
+info    820     hardware: exit simulation - done
diff --git a/src/test/test-disarm-frozen1/manager_status b/src/test/test-disarm-frozen1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-frozen1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-frozen1/service_config b/src/test/test-disarm-frozen1/service_config
new file mode 100644
index 0000000..c2ddbce
--- /dev/null
+++ b/src/test/test-disarm-frozen1/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "stopped" },
+    "vm:103": { "node": "node3", "state": "disabled" }
+}
\ No newline at end of file
diff --git a/src/test/test-disarm-ignored1/README b/src/test/test-disarm-ignored1/README
new file mode 100644
index 0000000..12bed55
--- /dev/null
+++ b/src/test/test-disarm-ignored1/README
@@ -0,0 +1,10 @@
+Test disarm-ha with ignore resource mode.
+
+Verify the full disarm cycle with ignore resource mode:
+1. Start 3 nodes with services
+2. Disarm HA with ignore resource mode
+3. Services are suspended from HA tracking but kept in service status
+4. LRMs should release locks and watchdogs (disarm mode)
+5. CRM should release watchdog once all LRMs disarmed
+6. Arm HA again
+7. Services resume normal HA tracking with their preserved states
diff --git a/src/test/test-disarm-ignored1/cmdlist b/src/test/test-disarm-ignored1/cmdlist
new file mode 100644
index 0000000..b8a0c04
--- /dev/null
+++ b/src/test/test-disarm-ignored1/cmdlist
@@ -0,0 +1,5 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "crm node1 disarm-ha ignore" ],
+    [ "crm node1 arm-ha" ]
+]
\ No newline at end of file
diff --git a/src/test/test-disarm-ignored1/hardware_status b/src/test/test-disarm-ignored1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-ignored1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-ignored1/log.expect b/src/test/test-disarm-ignored1/log.expect
new file mode 100644
index 0000000..2577dff
--- /dev/null
+++ b/src/test/test-disarm-ignored1/log.expect
@@ -0,0 +1,50 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     40    node1/crm: service 'vm:102': state changed from 'request_stop' to 'stopped'
+info     40    node1/crm: service 'vm:103': state changed from 'request_stop' to 'stopped'
+info    120      cmdlist: execute crm node1 disarm-ha ignore
+info    120    node1/crm: got crm command: disarm-ha ignore
+info    120    node1/crm: disarm: suspending HA tracking for service 'vm:101'
+info    120    node1/crm: disarm: suspending HA tracking for service 'vm:102'
+info    120    node1/crm: disarm: suspending HA tracking for service 'vm:103'
+info    121    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    121    node1/lrm: status change active => wait_for_agent_lock
+info    123    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    123    node2/lrm: status change active => wait_for_agent_lock
+info    125    node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info    125    node3/lrm: status change active => wait_for_agent_lock
+info    140    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    140    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    220      cmdlist: execute crm node1 arm-ha
+info    220    node1/crm: got crm command: arm-ha
+info    220    node1/crm: re-arming HA stack
+info    221    node1/lrm: got lock 'ha_agent_node1_lock'
+info    221    node1/lrm: status change wait_for_agent_lock => active
+info    820     hardware: exit simulation - done
diff --git a/src/test/test-disarm-ignored1/manager_status b/src/test/test-disarm-ignored1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-ignored1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-ignored1/service_config b/src/test/test-disarm-ignored1/service_config
new file mode 100644
index 0000000..c2ddbce
--- /dev/null
+++ b/src/test/test-disarm-ignored1/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "stopped" },
+    "vm:103": { "node": "node3", "state": "disabled" }
+}
\ No newline at end of file
diff --git a/src/test/test-disarm-ignored2/cmdlist b/src/test/test-disarm-ignored2/cmdlist
new file mode 100644
index 0000000..eecd37d
--- /dev/null
+++ b/src/test/test-disarm-ignored2/cmdlist
@@ -0,0 +1,6 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "crm node1 disarm-ha ignore" ],
+    [ "service vm:103 manual-migrate node1" ],
+    [ "crm node1 arm-ha" ]
+]
diff --git a/src/test/test-disarm-ignored2/hardware_status b/src/test/test-disarm-ignored2/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-ignored2/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-ignored2/log.expect b/src/test/test-disarm-ignored2/log.expect
new file mode 100644
index 0000000..5f37869
--- /dev/null
+++ b/src/test/test-disarm-ignored2/log.expect
@@ -0,0 +1,60 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node2)
+info     20    node1/crm: service 'vm:103': state changed from 'request_start' to 'started'  (node = node3)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service vm:102
+info     23    node2/lrm: service status vm:102 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info    120      cmdlist: execute crm node1 disarm-ha ignore
+info    120    node1/crm: got crm command: disarm-ha ignore
+info    120    node1/crm: disarm: suspending HA tracking for service 'vm:101'
+info    120    node1/crm: disarm: suspending HA tracking for service 'vm:102'
+info    120    node1/crm: disarm: suspending HA tracking for service 'vm:103'
+info    121    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    121    node1/lrm: status change active => wait_for_agent_lock
+info    123    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    123    node2/lrm: status change active => wait_for_agent_lock
+info    125    node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info    125    node3/lrm: status change active => wait_for_agent_lock
+info    140    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    140    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    220      cmdlist: execute service vm:103 manual-migrate node1
+info    320      cmdlist: execute crm node1 arm-ha
+info    320    node1/crm: got crm command: arm-ha
+info    320    node1/crm: service 'vm:103': updating node 'node3' => 'node1' (changed while disarmed)
+info    320    node1/crm: re-arming HA stack
+info    321    node1/lrm: got lock 'ha_agent_node1_lock'
+info    321    node1/lrm: status change wait_for_agent_lock => active
+info    321    node1/lrm: starting service vm:103
+info    321    node1/lrm: service status vm:103 started
+info    323    node2/lrm: got lock 'ha_agent_node2_lock'
+info    323    node2/lrm: status change wait_for_agent_lock => active
+info    920     hardware: exit simulation - done
diff --git a/src/test/test-disarm-ignored2/manager_status b/src/test/test-disarm-ignored2/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-ignored2/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-ignored2/service_config b/src/test/test-disarm-ignored2/service_config
new file mode 100644
index 0000000..4b26f6b
--- /dev/null
+++ b/src/test/test-disarm-ignored2/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
diff --git a/src/test/test-disarm-maintenance1/cmdlist b/src/test/test-disarm-maintenance1/cmdlist
new file mode 100644
index 0000000..6f8a8ea
--- /dev/null
+++ b/src/test/test-disarm-maintenance1/cmdlist
@@ -0,0 +1,7 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "crm node3 enable-node-maintenance" ],
+    [ "crm node1 disarm-ha freeze" ],
+    [ "crm node1 arm-ha" ],
+    [ "crm node3 disable-node-maintenance" ]
+]
diff --git a/src/test/test-disarm-maintenance1/hardware_status b/src/test/test-disarm-maintenance1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-maintenance1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-maintenance1/log.expect b/src/test/test-disarm-maintenance1/log.expect
new file mode 100644
index 0000000..b5e0e5b
--- /dev/null
+++ b/src/test/test-disarm-maintenance1/log.expect
@@ -0,0 +1,79 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node2)
+info     20    node1/crm: service 'vm:103': state changed from 'request_start' to 'started'  (node = node3)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service vm:102
+info     23    node2/lrm: service status vm:102 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info    120      cmdlist: execute crm node3 enable-node-maintenance
+info    125    node3/lrm: status change active => maintenance
+info    140    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+info    140    node1/crm: migrate service 'vm:103' to node 'node1' (running)
+info    140    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node3, target = node1)
+info    145    node3/lrm: service vm:103 - start migrate to node 'node1'
+info    145    node3/lrm: service vm:103 - end migrate to node 'node1'
+info    160    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
+info    161    node1/lrm: starting service vm:103
+info    161    node1/lrm: service status vm:103 started
+info    220      cmdlist: execute crm node1 disarm-ha freeze
+info    220    node1/crm: got crm command: disarm-ha freeze
+info    220    node1/crm: disarm: freezing service 'vm:101' (was 'started')
+info    220    node1/crm: disarm: freezing service 'vm:102' (was 'started')
+info    220    node1/crm: disarm: freezing service 'vm:103' (was 'started')
+info    221    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    221    node1/lrm: status change active => wait_for_agent_lock
+info    223    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    223    node2/lrm: status change active => wait_for_agent_lock
+info    240    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    240    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    320      cmdlist: execute crm node1 arm-ha
+info    320    node1/crm: got crm command: arm-ha
+info    320    node1/crm: re-arming HA stack
+info    340    node1/crm: service 'vm:101': state changed from 'freeze' to 'started'
+info    340    node1/crm: service 'vm:102': state changed from 'freeze' to 'started'
+info    340    node1/crm: service 'vm:103': state changed from 'freeze' to 'started'
+info    341    node1/lrm: got lock 'ha_agent_node1_lock'
+info    341    node1/lrm: status change wait_for_agent_lock => active
+info    343    node2/lrm: got lock 'ha_agent_node2_lock'
+info    343    node2/lrm: status change wait_for_agent_lock => active
+info    420      cmdlist: execute crm node3 disable-node-maintenance
+info    425    node3/lrm: got lock 'ha_agent_node3_lock'
+info    425    node3/lrm: status change maintenance => active
+info    440    node1/crm: node 'node3': state changed from 'maintenance' => 'online'
+info    440    node1/crm: moving service 'vm:103' back to 'node3', node came back from maintenance.
+info    440    node1/crm: migrate service 'vm:103' to node 'node3' (running)
+info    440    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node1, target = node3)
+info    441    node1/lrm: service vm:103 - start migrate to node 'node3'
+info    441    node1/lrm: service vm:103 - end migrate to node 'node3'
+info    460    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node3)
+info    465    node3/lrm: starting service vm:103
+info    465    node3/lrm: service status vm:103 started
+info   1020     hardware: exit simulation - done
diff --git a/src/test/test-disarm-maintenance1/manager_status b/src/test/test-disarm-maintenance1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-maintenance1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-maintenance1/service_config b/src/test/test-disarm-maintenance1/service_config
new file mode 100644
index 0000000..4b26f6b
--- /dev/null
+++ b/src/test/test-disarm-maintenance1/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
diff --git a/src/test/test-disarm-maintenance2/cmdlist b/src/test/test-disarm-maintenance2/cmdlist
new file mode 100644
index 0000000..2c85b4d
--- /dev/null
+++ b/src/test/test-disarm-maintenance2/cmdlist
@@ -0,0 +1,7 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "crm node3 enable-node-maintenance" ],
+    [ "crm node1 disarm-ha ignore" ],
+    [ "crm node3 disable-node-maintenance" ],
+    [ "crm node1 arm-ha" ]
+]
diff --git a/src/test/test-disarm-maintenance2/hardware_status b/src/test/test-disarm-maintenance2/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-maintenance2/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-maintenance2/log.expect b/src/test/test-disarm-maintenance2/log.expect
new file mode 100644
index 0000000..b21b72f
--- /dev/null
+++ b/src/test/test-disarm-maintenance2/log.expect
@@ -0,0 +1,78 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node2)
+info     20    node1/crm: service 'vm:103': state changed from 'request_start' to 'started'  (node = node3)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service vm:102
+info     23    node2/lrm: service status vm:102 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info    120      cmdlist: execute crm node3 enable-node-maintenance
+info    125    node3/lrm: status change active => maintenance
+info    140    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+info    140    node1/crm: migrate service 'vm:103' to node 'node1' (running)
+info    140    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node3, target = node1)
+info    145    node3/lrm: service vm:103 - start migrate to node 'node1'
+info    145    node3/lrm: service vm:103 - end migrate to node 'node1'
+info    160    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
+info    161    node1/lrm: starting service vm:103
+info    161    node1/lrm: service status vm:103 started
+info    220      cmdlist: execute crm node1 disarm-ha ignore
+info    220    node1/crm: got crm command: disarm-ha ignore
+info    220    node1/crm: disarm: suspending HA tracking for service 'vm:101'
+info    220    node1/crm: disarm: suspending HA tracking for service 'vm:102'
+info    220    node1/crm: disarm: suspending HA tracking for service 'vm:103'
+info    221    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    221    node1/lrm: status change active => wait_for_agent_lock
+info    223    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    223    node2/lrm: status change active => wait_for_agent_lock
+info    240    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    240    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    320      cmdlist: execute crm node3 disable-node-maintenance
+info    325    node3/lrm: HA disarm requested during maintenance, releasing agent lock and watchdog
+info    325    node3/lrm: status change maintenance => wait_for_agent_lock
+info    340    node1/crm: node 'node3': state changed from 'maintenance' => 'online'
+info    420      cmdlist: execute crm node1 arm-ha
+info    420    node1/crm: got crm command: arm-ha
+info    420    node1/crm: moving service 'vm:103' back to 'node3', node came back from maintenance.
+info    420    node1/crm: migrate service 'vm:103' to node 'node3' (running)
+info    420    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node1, target = node3)
+info    420    node1/crm: re-arming HA stack
+info    421    node1/lrm: got lock 'ha_agent_node1_lock'
+info    421    node1/lrm: status change wait_for_agent_lock => active
+info    421    node1/lrm: service vm:103 - start migrate to node 'node3'
+info    421    node1/lrm: service vm:103 - end migrate to node 'node3'
+info    423    node2/lrm: got lock 'ha_agent_node2_lock'
+info    423    node2/lrm: status change wait_for_agent_lock => active
+info    425    node3/lrm: got lock 'ha_agent_node3_lock'
+info    425    node3/lrm: status change wait_for_agent_lock => active
+info    440    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node3)
+info    445    node3/lrm: starting service vm:103
+info    445    node3/lrm: service status vm:103 started
+info   1020     hardware: exit simulation - done
diff --git a/src/test/test-disarm-maintenance2/manager_status b/src/test/test-disarm-maintenance2/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-maintenance2/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-maintenance2/service_config b/src/test/test-disarm-maintenance2/service_config
new file mode 100644
index 0000000..4b26f6b
--- /dev/null
+++ b/src/test/test-disarm-maintenance2/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
diff --git a/src/test/test-disarm-maintenance3/cmdlist b/src/test/test-disarm-maintenance3/cmdlist
new file mode 100644
index 0000000..d49095c
--- /dev/null
+++ b/src/test/test-disarm-maintenance3/cmdlist
@@ -0,0 +1,8 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "crm node3 enable-node-maintenance" ],
+    [ "crm node1 disarm-ha ignore" ],
+    [ "service vm:103 manual-migrate node2" ],
+    [ "crm node3 disable-node-maintenance" ],
+    [ "crm node1 arm-ha" ]
+]
diff --git a/src/test/test-disarm-maintenance3/hardware_status b/src/test/test-disarm-maintenance3/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-maintenance3/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-maintenance3/log.expect b/src/test/test-disarm-maintenance3/log.expect
new file mode 100644
index 0000000..b26f8b8
--- /dev/null
+++ b/src/test/test-disarm-maintenance3/log.expect
@@ -0,0 +1,80 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node2)
+info     20    node1/crm: service 'vm:103': state changed from 'request_start' to 'started'  (node = node3)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service vm:102
+info     23    node2/lrm: service status vm:102 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info    120      cmdlist: execute crm node3 enable-node-maintenance
+info    125    node3/lrm: status change active => maintenance
+info    140    node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+info    140    node1/crm: migrate service 'vm:103' to node 'node1' (running)
+info    140    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node3, target = node1)
+info    145    node3/lrm: service vm:103 - start migrate to node 'node1'
+info    145    node3/lrm: service vm:103 - end migrate to node 'node1'
+info    160    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node1)
+info    161    node1/lrm: starting service vm:103
+info    161    node1/lrm: service status vm:103 started
+info    220      cmdlist: execute crm node1 disarm-ha ignore
+info    220    node1/crm: got crm command: disarm-ha ignore
+info    220    node1/crm: disarm: suspending HA tracking for service 'vm:101'
+info    220    node1/crm: disarm: suspending HA tracking for service 'vm:102'
+info    220    node1/crm: disarm: suspending HA tracking for service 'vm:103'
+info    221    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    221    node1/lrm: status change active => wait_for_agent_lock
+info    223    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    223    node2/lrm: status change active => wait_for_agent_lock
+info    240    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    240    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    320      cmdlist: execute service vm:103 manual-migrate node2
+info    420      cmdlist: execute crm node3 disable-node-maintenance
+info    425    node3/lrm: HA disarm requested during maintenance, releasing agent lock and watchdog
+info    425    node3/lrm: status change maintenance => wait_for_agent_lock
+info    440    node1/crm: node 'node3': state changed from 'maintenance' => 'online'
+info    520      cmdlist: execute crm node1 arm-ha
+info    520    node1/crm: got crm command: arm-ha
+info    520    node1/crm: service 'vm:103': updating node 'node1' => 'node2' (changed while disarmed)
+info    520    node1/crm: moving service 'vm:103' back to 'node3', node came back from maintenance.
+info    520    node1/crm: migrate service 'vm:103' to node 'node3' (running)
+info    520    node1/crm: service 'vm:103': state changed from 'started' to 'migrate'  (node = node2, target = node3)
+info    520    node1/crm: re-arming HA stack
+info    521    node1/lrm: got lock 'ha_agent_node1_lock'
+info    521    node1/lrm: status change wait_for_agent_lock => active
+info    523    node2/lrm: got lock 'ha_agent_node2_lock'
+info    523    node2/lrm: status change wait_for_agent_lock => active
+info    523    node2/lrm: service vm:103 - start migrate to node 'node3'
+info    523    node2/lrm: service vm:103 - end migrate to node 'node3'
+info    525    node3/lrm: got lock 'ha_agent_node3_lock'
+info    525    node3/lrm: status change wait_for_agent_lock => active
+info    540    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node3)
+info    545    node3/lrm: starting service vm:103
+info    545    node3/lrm: service status vm:103 started
+info   1120     hardware: exit simulation - done
diff --git a/src/test/test-disarm-maintenance3/manager_status b/src/test/test-disarm-maintenance3/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-maintenance3/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-maintenance3/service_config b/src/test/test-disarm-maintenance3/service_config
new file mode 100644
index 0000000..4b26f6b
--- /dev/null
+++ b/src/test/test-disarm-maintenance3/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
diff --git a/src/test/test-disarm-relocate1/README b/src/test/test-disarm-relocate1/README
new file mode 100644
index 0000000..a5b6324
--- /dev/null
+++ b/src/test/test-disarm-relocate1/README
@@ -0,0 +1,3 @@
+Test disarm-ha freeze when a relocate command arrives in the same CRM cycle.
+The disarm takes priority: the relocate command is pre-empted and the service
+is frozen directly. After arm-ha, both services resume normally.
diff --git a/src/test/test-disarm-relocate1/cmdlist b/src/test/test-disarm-relocate1/cmdlist
new file mode 100644
index 0000000..99f2916
--- /dev/null
+++ b/src/test/test-disarm-relocate1/cmdlist
@@ -0,0 +1,7 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "service vm:101 relocate node2", "crm node1 disarm-ha freeze" ],
+    [],
+    [],
+    [ "crm node1 arm-ha" ]
+]
diff --git a/src/test/test-disarm-relocate1/hardware_status b/src/test/test-disarm-relocate1/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-disarm-relocate1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-relocate1/log.expect b/src/test/test-disarm-relocate1/log.expect
new file mode 100644
index 0000000..b051cac
--- /dev/null
+++ b/src/test/test-disarm-relocate1/log.expect
@@ -0,0 +1,51 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node2)
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     23    node2/lrm: starting service vm:102
+info     23    node2/lrm: service status vm:102 started
+info     24    node3/crm: status change wait_for_quorum => slave
+info    120      cmdlist: execute service vm:101 relocate node2
+info    120      cmdlist: execute crm node1 disarm-ha freeze
+info    120    node1/crm: got crm command: relocate vm:101 node2
+info    120    node1/crm: got crm command: disarm-ha freeze
+info    120    node1/crm: disarm: freezing service 'vm:101' (was 'started')
+info    120    node1/crm: disarm: freezing service 'vm:102' (was 'started')
+info    121    node1/lrm: HA disarm requested, releasing agent lock and watchdog
+info    121    node1/lrm: status change active => wait_for_agent_lock
+info    123    node2/lrm: HA disarm requested, releasing agent lock and watchdog
+info    123    node2/lrm: status change active => wait_for_agent_lock
+info    140    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info    140    node1/crm: HA stack fully disarmed, releasing CRM watchdog
+info    420      cmdlist: execute crm node1 arm-ha
+info    420    node1/crm: got crm command: arm-ha
+info    420    node1/crm: re-arming HA stack
+info    440    node1/crm: service 'vm:101': state changed from 'freeze' to 'started'
+info    440    node1/crm: service 'vm:102': state changed from 'freeze' to 'started'
+info    441    node1/lrm: got lock 'ha_agent_node1_lock'
+info    441    node1/lrm: status change wait_for_agent_lock => active
+info    443    node2/lrm: got lock 'ha_agent_node2_lock'
+info    443    node2/lrm: status change wait_for_agent_lock => active
+info   1020     hardware: exit simulation - done
diff --git a/src/test/test-disarm-relocate1/manager_status b/src/test/test-disarm-relocate1/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-disarm-relocate1/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-disarm-relocate1/service_config b/src/test/test-disarm-relocate1/service_config
new file mode 100644
index 0000000..0336d09
--- /dev/null
+++ b/src/test/test-disarm-relocate1/service_config
@@ -0,0 +1,4 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" }
+}
-- 
2.47.3





^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH ha-manager v2 4/4] api: status: add disarm-ha and arm-ha endpoints and CLI wiring
  2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Thomas Lamprecht
                   ` (2 preceding siblings ...)
  2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
@ 2026-03-21 23:42 ` Thomas Lamprecht
  2026-03-23 13:05 ` [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac
  2026-03-25 12:06 ` applied: " Thomas Lamprecht
  5 siblings, 0 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2026-03-21 23:42 UTC (permalink / raw)
  To: pve-devel

Expose the disarm/arm mechanism as two separate POST endpoints under
/cluster/ha/status/ and wire them into the crm-command CLI namespace.

Extend the fencing status entry with disarming/disarmed states and
the active resource mode. Each LRM entry shows 'watchdog released'
once in disarm mode. The master and service status lines include the
disarm state when applicable.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---

changes v1 -> v2:
  - mark both endpoints as protected, otherwise only CLI worked [Dominik]
  - show 'ignore' as service state during disarm-ignore mode for all
    services now that we preserve the service status in general when
    disarmed.

 src/PVE/API2/HA/Status.pm | 142 ++++++++++++++++++++++++++++++++------
 src/PVE/CLI/ha_manager.pm |   2 +
 2 files changed, 123 insertions(+), 21 deletions(-)

diff --git a/src/PVE/API2/HA/Status.pm b/src/PVE/API2/HA/Status.pm
index a6b00b9..59a93b6 100644
--- a/src/PVE/API2/HA/Status.pm
+++ b/src/PVE/API2/HA/Status.pm
@@ -52,7 +52,10 @@ __PACKAGE__->register_method({
         my ($param) = @_;
 
         my $result = [
-            { name => 'current' }, { name => 'manager_status' },
+            { name => 'current' },
+            { name => 'manager_status' },
+            { name => 'disarm-ha' },
+            { name => 'arm-ha' },
         ];
 
         return $result;
@@ -144,10 +147,17 @@ __PACKAGE__->register_method({
                     optional => 1,
                 },
                 'armed-state' => {
-                    description => "For type 'fencing'. Whether HA fencing is armed"
-                        . " or on standby.",
+                    description => "For type 'fencing'. Whether HA is armed, on standby,"
+                        . " disarming or disarmed.",
                     type => "string",
-                    enum => ['armed', 'standby'],
+                    enum => ['armed', 'standby', 'disarming', 'disarmed'],
+                    optional => 1,
+                },
+                resource_mode => {
+                    description =>
+                        "For type 'fencing'. How resources are handled while disarmed.",
+                    type => "string",
+                    enum => ['freeze', 'ignore'],
                     optional => 1,
                 },
             },
@@ -184,9 +194,13 @@ __PACKAGE__->register_method({
 
             my $extra_status = '';
 
+            if (my $disarm = $status->{disarm}) {
+                $extra_status .= " - $disarm->{state}, resource mode: $disarm->{mode}";
+            }
             my $datacenter_config = eval { cfs_read_file('datacenter.cfg') } // {};
             if (my $crs = $datacenter_config->{crs}) {
-                $extra_status = " - $crs->{ha} load CRS" if $crs->{ha} && $crs->{ha} ne 'basic';
+                $extra_status .= " - $crs->{ha} load CRS"
+                    if $crs->{ha} && $crs->{ha} ne 'basic';
             }
             my $time_str = localtime($status->{timestamp});
             my $status_text = "$master ($status_str, $time_str)$extra_status";
@@ -206,16 +220,32 @@ __PACKAGE__->register_method({
             && defined($status->{timestamp})
             && $timestamp_to_status->($ctime, $status->{timestamp}) eq 'active';
 
-        my $armed_state = $crm_active ? 'armed' : 'standby';
-        my $crm_wd = $crm_active ? "CRM watchdog active" : "CRM watchdog standby";
-        push @$res,
-            {
-                id => 'fencing',
-                type => 'fencing',
-                node => $status->{master_node} // $nodename,
-                status => "$armed_state ($crm_wd)",
-                'armed-state' => $armed_state,
-            };
+        if (my $disarm = $status->{disarm}) {
+            my $mode = $disarm->{mode} // 'unknown';
+            my $disarm_state = $disarm->{state} // 'unknown';
+            my $wd_released = $disarm_state eq 'disarmed';
+            my $crm_wd = $wd_released ? "CRM watchdog released" : "CRM watchdog active";
+            push @$res,
+                {
+                    id => 'fencing',
+                    type => 'fencing',
+                    node => $status->{master_node} // $nodename,
+                    status => "$disarm_state, resource mode: $mode ($crm_wd)",
+                    'armed-state' => $disarm_state,
+                    resource_mode => $mode,
+                };
+        } else {
+            my $armed_state = $crm_active ? 'armed' : 'standby';
+            my $crm_wd = $crm_active ? "CRM watchdog active" : "CRM watchdog standby";
+            push @$res,
+                {
+                    id => 'fencing',
+                    type => 'fencing',
+                    node => $status->{master_node} // $nodename,
+                    status => "$armed_state ($crm_wd)",
+                    'armed-state' => $armed_state,
+                };
+        }
 
         foreach my $node (sort keys %{ $status->{node_status} }) {
             my $active_count =
@@ -236,11 +266,17 @@ __PACKAGE__->register_method({
                 my $lrm_state = $lrm_status->{state} || 'unknown';
 
                 # LRM holds its watchdog while it has the agent lock
-                my $lrm_wd =
-                    ($status_str eq 'active'
-                        && ($lrm_state eq 'active' || $lrm_state eq 'maintenance'))
-                    ? 'watchdog active'
-                    : 'watchdog standby';
+                my $lrm_wd;
+                if (
+                    $status_str eq 'active'
+                    && ($lrm_state eq 'active' || $lrm_state eq 'maintenance')
+                ) {
+                    $lrm_wd = 'watchdog active';
+                } elsif ($lrm_mode && $lrm_mode eq 'disarm') {
+                    $lrm_wd = 'watchdog released';
+                } else {
+                    $lrm_wd = 'watchdog standby';
+                }
 
                 if ($status_str eq 'active') {
                     $lrm_mode ||= 'active';
@@ -253,7 +289,7 @@ __PACKAGE__->register_method({
                             $status_str = $lrm_state;
                         }
                     }
-                } elsif ($lrm_mode && $lrm_mode eq 'maintenance') {
+                } elsif ($lrm_mode && ($lrm_mode eq 'maintenance' || $lrm_mode eq 'disarm')) {
                     $status_str = "$lrm_mode mode";
                 }
 
@@ -284,6 +320,14 @@ __PACKAGE__->register_method({
             my $node = $data->{node} // '---'; # to be safe against manual tinkering
 
             $data->{state} = PVE::HA::Tools::get_verbose_service_state($ss, $sc);
+
+            # show disarm resource mode instead of internal service state
+            if (my $disarm = $status->{disarm}) {
+                if ($disarm->{mode} eq 'ignore') {
+                    $data->{state} = 'ignore';
+                }
+            }
+
             $data->{status} = "$sid ($node, $data->{state})"; # backward compat. and CLI
 
             # also return common resource attributes
@@ -348,4 +392,60 @@ __PACKAGE__->register_method({
     },
 });
 
+__PACKAGE__->register_method({
+    name => 'disarm-ha',
+    path => 'disarm-ha',
+    method => 'POST',
+    protected => 1,
+    description => "Request disarming the HA stack, releasing all watchdogs cluster-wide.",
+    permissions => {
+        check => ['perm', '/', ['Sys.Console']],
+    },
+    parameters => {
+        additionalProperties => 0,
+        properties => {
+            'resource-mode' => {
+                description => "Controls how HA managed resources are handled while disarmed."
+                    . " The current state of resources is not affected."
+                    . " 'freeze': new commands and state changes are not applied."
+                    . " 'ignore': resources are removed from HA tracking and can be"
+                    . " managed as if they were not HA managed.",
+                type => 'string',
+                enum => ['freeze', 'ignore'],
+            },
+        },
+    },
+    returns => { type => 'null' },
+    code => sub {
+        my ($param) = @_;
+
+        PVE::HA::Config::queue_crm_commands("disarm-ha $param->{'resource-mode'}");
+
+        return undef;
+    },
+});
+
+__PACKAGE__->register_method({
+    name => 'arm-ha',
+    path => 'arm-ha',
+    method => 'POST',
+    protected => 1,
+    description => "Request re-arming the HA stack after it was disarmed.",
+    permissions => {
+        check => ['perm', '/', ['Sys.Console']],
+    },
+    parameters => {
+        additionalProperties => 0,
+        properties => {},
+    },
+    returns => { type => 'null' },
+    code => sub {
+        my ($param) = @_;
+
+        PVE::HA::Config::queue_crm_commands("arm-ha");
+
+        return undef;
+    },
+});
+
 1;
diff --git a/src/PVE/CLI/ha_manager.pm b/src/PVE/CLI/ha_manager.pm
index be6978c..f257c01 100644
--- a/src/PVE/CLI/ha_manager.pm
+++ b/src/PVE/CLI/ha_manager.pm
@@ -298,6 +298,8 @@ our $cmddef = {
             enable => [__PACKAGE__, 'node-maintenance-set', ['node'], { disable => 0 }],
             disable => [__PACKAGE__, 'node-maintenance-set', ['node'], { disable => 1 }],
         },
+        'disarm-ha' => ['PVE::API2::HA::Status', 'disarm-ha', ['resource-mode']],
+        'arm-ha' => ['PVE::API2::HA::Status', 'arm-ha', []],
     },
 
 };
-- 
2.47.3





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
  2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
@ 2026-03-23 13:04   ` Dominik Rusovac
  2026-03-25 15:50   ` Fiona Ebner
  2026-03-26 16:02   ` Daniel Kral
  2 siblings, 0 replies; 13+ messages in thread
From: Dominik Rusovac @ 2026-03-23 13:04 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

two negligible nits inline

On Sun Mar 22, 2026 at 12:42 AM CET, Thomas Lamprecht wrote:

[snip]

> diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
> index 301c391..f6857bd 100644
> --- a/src/PVE/HA/Sim/Hardware.pm
> +++ b/src/PVE/HA/Sim/Hardware.pm
> @@ -835,6 +835,15 @@ sub sim_hardware_cmd {
>                  || $action eq 'disable-node-maintenance'
>              ) {
>                  $self->queue_crm_commands_nolock("$action $node");
> +            } elsif ($action eq 'disarm-ha') {
> +                my $mode = $params[0];
> +
> +                die "sim_hardware_cmd: unknown resource mode '$mode'\n"
> +                    if $mode !~ m/^(freeze|ignore)$/;
> +
> +                $self->queue_crm_commands_nolock("disarm-ha $mode");

    $self->queue_crm_commands_nolock("$action $mode");

nit: following DRY principle, this would be cleaner imo

> +            } elsif ($action eq 'arm-ha') {
> +                $self->queue_crm_commands_nolock("arm-ha");

    $self->queue_crm_commands_nolock("$action");

nit: same here

[snip]




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance
  2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Thomas Lamprecht
                   ` (3 preceding siblings ...)
  2026-03-21 23:42 ` [PATCH ha-manager v2 4/4] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
@ 2026-03-23 13:05 ` Dominik Rusovac
  2026-03-25 12:06 ` applied: " Thomas Lamprecht
  5 siblings, 0 replies; 13+ messages in thread
From: Dominik Rusovac @ 2026-03-23 13:05 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

Thx for coming up with a v2 so quickly! 

On Sun Mar 22, 2026 at 12:42 AM CET, Thomas Lamprecht wrote:
> The biggest change compared to v1 is how ignore mode handles the service
> status: instead of clearing it entirely, the relevant parts of service
> status are now preserved across the disarm/arm cycle. This allows
> runtime state like maintenance_node to survive, so services correctly
> migrate back to their original node after maintenance ends, even if the
> disarm happened while maintenance was active. Thanks @Dominik R. for
> noticing this.

Alongside repeating the tests conducted for v1, which again were OK, 
I also conducted tests resembling test-disarm-maintenance{1,2,3}
in a real cluster, which were also OK.

>
> To keep the preserved state clean, stale runtime data (failed_nodes,
> cmd, target, ...) is pruned from service entries on disarm - both in
> freeze and ignore mode - so the state machine starts fresh on re-arm.
> The status API overrides the displayed service state to 'ignore' during
> disarm-ignore mode, while the internal state stays untouched for
> seamless resume.
>
> On arm-ha from ignore mode, the CRM now rechecks the previous resource's
> node against the resource service config, picking up any manual
> migrations the admin performed while HA tracking was suspended.
>
> First patch 1/4 is new and adds a manual-migrate simulator command as a
> preparatory patch, since it is independently useful for testing the
> per-service 'ignored' state handling.
>
> Previous discussion and v1:
> https://lore.proxmox.com/pve-devel/20260309220128.973793-1-t.lamprecht@proxmox.com/
>

[snip]

lgtm, consider this series.

Reviewed-by: Dominik Rusovac <d.rusovac@proxmox.com>
Tested-by: Dominik Rusovac <d.rusovac@proxmox.com>




^ permalink raw reply	[flat|nested] 13+ messages in thread

* applied: [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance
  2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Thomas Lamprecht
                   ` (4 preceding siblings ...)
  2026-03-23 13:05 ` [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac
@ 2026-03-25 12:06 ` Thomas Lamprecht
  5 siblings, 0 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2026-03-25 12:06 UTC (permalink / raw)
  To: pve-devel, Thomas Lamprecht

On Sun, 22 Mar 2026 00:42:49 +0100, Thomas Lamprecht wrote:
> The biggest change compared to v1 is how ignore mode handles the service
> status: instead of clearing it entirely, the relevant parts of service
> status are now preserved across the disarm/arm cycle. This allows
> runtime state like maintenance_node to survive, so services correctly
> migrate back to their original node after maintenance ends, even if the
> disarm happened while maintenance was active. Thanks @Dominik R. for
> noticing this.
> 
> [...]

Applied, with the DRY style issue that Dominik spotted squashed in (thanks!)

[1/4] sim: hardware: add manual-migrate command for ignored services
      commit: be22651cf9756d2e02bca428ab708fc3fedcaa9b
[2/4] api: status: add fencing status entry with armed/standby state
      commit: 2d143acf0289badaa60f5222af0f260f0efc55ac
[3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
      commit: d9714e7047dda84b5bd901d39471944ef1d4fe98
[4/4] api: status: add disarm-ha and arm-ha endpoints and CLI wiring
      commit: 3f2e82674b86de92e4526048c1356aec54e527f5




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
  2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
  2026-03-23 13:04   ` Dominik Rusovac
@ 2026-03-25 15:50   ` Fiona Ebner
  2026-03-27  1:17     ` Thomas Lamprecht
  2026-03-26 16:02   ` Daniel Kral
  2 siblings, 1 reply; 13+ messages in thread
From: Fiona Ebner @ 2026-03-25 15:50 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

Am 22.03.26 um 12:57 AM schrieb Thomas Lamprecht:
> +    if ($mode eq 'freeze') {
> +        for my $sid (sort keys %$ss) {
> +            my $sd = $ss->{$sid};
> +            my $state = $sd->{state};
> +            next if $state eq 'freeze'; # already frozen
> +            if (
> +                $state eq 'started'
> +                || $state eq 'stopped'
> +                || $state eq 'request_stop'
> +                || $state eq 'request_start'
> +                || $state eq 'request_start_balance'
> +                || $state eq 'error'

Should it really happen for the 'error' state too? Because when
re-arming, the state will become 'started':

Mar 25 16:20:06 pve9a1 pve-ha-crm[242553]: disarm: freezing service
'vm:400' (was 'error')
...
Mar 25 16:20:36 pve9a1 pve-ha-crm[242553]: service 'vm:400': state
changed from 'freeze' to 'started'

Which feels rather surprising to me. For comparison, after a cold
cluster start, services in 'error' state are not (attempted to be)
started either.

> +            ) {
> +                $haenv->log('info', "disarm: freezing service '$sid' (was '$state')");
> +                delete $sd->{$_} for grep { !$keep_keys{$_} } keys %$sd;
> +                $sd->{state} = 'freeze';
> +                $sd->{uid} = compute_new_uuid('freeze');
> +            }
> +        }




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
  2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
  2026-03-23 13:04   ` Dominik Rusovac
  2026-03-25 15:50   ` Fiona Ebner
@ 2026-03-26 16:02   ` Daniel Kral
  2026-03-26 23:15     ` Thomas Lamprecht
  2 siblings, 1 reply; 13+ messages in thread
From: Daniel Kral @ 2026-03-26 16:02 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

Nice work!

While rebasing the load balancing patches on this, I had a few thoughts
about the transient states, which I wanted to add below. Nothing really
urgent and I might have missed something so correct me if I'm wrong.


One small UX nit:

As the test case 'test-disarm-relocate1' documents, the 'disarm-ha'
command takes priority over the migrate/relocate command. It is more of
an edge case anyway that both a migration an disarm-ha command are
issued in the same HA Manager cycle, but of course could still happen.

Handling the disarming as an admin task is more important than a
user-initiated migration request. It would still be great that this
action is relayed back to the user for the migration request (which
might only see the hamigrate task but no following qmigrate task), but
this is more an UX thing and should be handled e.g. as part of #6220 [0]
so users can directly follow the HA request and the further actions.

[0] https://bugzilla.proxmox.com/show_bug.cgi?id=6220

On Sun Mar 22, 2026 at 12:42 AM CET, Thomas Lamprecht wrote:
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index b1dbe6a..aa29858 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -631,6 +684,94 @@ sub try_persistent_group_migration {
>      }
>  }
>  
> +sub handle_disarm {
> +    my ($self, $disarm, $ss, $lrm_modes) = @_;
> +
> +    my $haenv = $self->{haenv};
> +    my $ns = $self->{ns};
> +
> +    # defer disarm if any services are in a transient state that needs the state machine to resolve
> +    for my $sid (sort keys %$ss) {
> +        my $state = $ss->{$sid}->{state};
> +        if ($state eq 'fence' || $state eq 'recovery') {
> +            $haenv->log(
> +                'warn', "deferring disarm - service '$sid' is in '$state' state",
> +            );
> +            return 0; # let manage() continue so fence/recovery can progress
> +        }
> +        if ($state eq 'migrate' || $state eq 'relocate') {
> +            $haenv->log(
> +                'info', "deferring disarm - service '$sid' is in '$state' state",
> +            );
> +            return 0; # let manage() continue so migration can complete
> +        }
> +    }

Here the HA disarming is deferred so that the HA Manager can continue
with processing the HA resources FSM if at least one of the HA resources
is in one of the 4 transient states.

This can have side-effects for other HA resources as well, which
currently aren't in one of these 4 transient states, but are implicitly
in a transient state.


For example:

There are two HA resources in the current service state.

{
   ...
   "service_status": {
       "vm:101": {
           "state": "started",
           "node": "node1",
           "cmd": [ "migrate", "node2" ],
           "uid": "...",
       },
       "vm:102": {
           "state": "...",
           "node": "node1",
           "uid": "...",
       },
   }
}


If vm:102 is in state 'started', then the HA Manager will start to
disarm the HA stack right away and drop vm:101's migration request in
both disarm modes.

If vm:102 is in any of the 4 transient states from above though, then
the HA Manager will defer the disarm and process all HA resources, which
will cause 'vm:101' to be put in 'migrate' now.


So the transient state of one HA resource might cause other HA resources
to be put in a transient state as well, even though I would have
expected the deferring here to only resolve the existing transient
states.

This also means that the HA Manager will handle user requests for HA
resources in the same HA Manager cycle as a disarm request if there is
at least one HA resource in one of the 4 transient states. This
contradicts that disarming takes priority over the migrate/relocate
request.

Two other possible cases of 'implicitly transient' states might be:

- adding a HA rule, which makes a HA resource in 'started' state be put
  in 'migrate' to another node when processing the select_service_node()
  in next_state_started().

- the node of a HA resource is offline delayed in the same round as the
  disarm request. If none of the HA resources are in a transient state
  yet, the disarm request goes through, otherwise the affected HA
  resources might be put in 'fence'.


I haven't thought this through fully, but an option might be that we
only allow the FSM processing of the HA resources, which are in one of
these 4 transient states and don't process the others.

E.g. breaking out the FSM transition loop into its own function and in
normal operation we iterate through all services in $ss, but for
deferred disarming we only iterate through the HA resources in transient
states, which should be resolved.


> +
> +    my $mode = $disarm->{mode};
> +
> +    # prune stale runtime data (failed_nodes, cmd, target, ...) so the state machine starts
> +    # fresh on re-arm; preserve maintenance_node for correct return behavior
> +    my %keep_keys = map { $_ => 1 } qw(state node uid maintenance_node);
> +
> +    if ($mode eq 'freeze') {
> +        for my $sid (sort keys %$ss) {
> +            my $sd = $ss->{$sid};
> +            my $state = $sd->{state};
> +            next if $state eq 'freeze'; # already frozen
> +            if (
> +                $state eq 'started'
> +                || $state eq 'stopped'
> +                || $state eq 'request_stop'
> +                || $state eq 'request_start'
> +                || $state eq 'request_start_balance'
> +                || $state eq 'error'
> +            ) {
> +                $haenv->log('info', "disarm: freezing service '$sid' (was '$state')");
> +                delete $sd->{$_} for grep { !$keep_keys{$_} } keys %$sd;
> +                $sd->{state} = 'freeze';
> +                $sd->{uid} = compute_new_uuid('freeze');
> +            }
> +        }
> +    } elsif ($mode eq 'ignore') {
> +        # keep $ss intact; the disarm flag in $ms causes service loops and vm_is_ha_managed()
> +        # to skip these services while disarmed
> +        for my $sid (sort keys %$ss) {
> +            my $sd = $ss->{$sid};
> +            delete $sd->{$_} for grep { !$keep_keys{$_} } keys %$sd;
> +        }
> +    }
> +
> +    # check if all online LRMs have entered disarm mode
> +    my $all_disarmed = 1;
> +    my $online_nodes = $ns->list_online_nodes();
> +
> +    for my $node (@$online_nodes) {
> +        my $lrm_mode = $lrm_modes->{$node} // 'unknown';
> +        if ($lrm_mode ne 'disarm') {
> +            $all_disarmed = 0;
> +            last;
> +        }
> +    }
> +
> +    if ($all_disarmed && $disarm->{state} ne 'disarmed') {
> +        $haenv->log('info', "all LRMs disarmed, HA stack is now fully disarmed");
> +        $disarm->{state} = 'disarmed';
> +    }
> +
> +    # once disarmed, stay disarmed - a returning node's LRM will catch up within one cycle
> +    $self->{all_lrms_disarmed} = $disarm->{state} eq 'disarmed';
> +
> +    $self->flush_master_status();
> +
> +    return 1;
> +}
> +
> +sub is_fully_disarmed {
> +    my ($self) = @_;
> +
> +    return $self->{all_lrms_disarmed};
> +}
> +
>  sub manage {
>      my ($self) = @_;
>  
> @@ -713,6 +857,15 @@ sub manage {
>  
>      $self->update_crm_commands();
>  
> +    if (my $disarm = $ms->{disarm}) {
> +        if ($self->handle_disarm($disarm, $ss, $lrm_modes)) {
> +            return; # disarm active and progressing, skip normal service state machine
> +        }
> +        # disarm deferred (e.g. due to active fencing) - fall through to let it complete
> +    }

Regarding the load balancing patches, I would change this to

- }
+ } else {
+     # load balance only if disarm is disabled as during a deferred disarm
+     # the HA Manager should not introduce any new migrations
+     $self->load_balance();
+ }

there.

> +
> +    $self->{all_lrms_disarmed} = 0;
> +
>      for (;;) {
>          my $repeat = 0;




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
  2026-03-26 16:02   ` Daniel Kral
@ 2026-03-26 23:15     ` Thomas Lamprecht
  2026-03-27 10:21       ` Daniel Kral
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Lamprecht @ 2026-03-26 23:15 UTC (permalink / raw)
  To: Daniel Kral, pve-devel

Am 26.03.26 um 17:02 schrieb Daniel Kral:
> Nice work!
> 
> While rebasing the load balancing patches on this, I had a few thoughts
> about the transient states, which I wanted to add below. Nothing really
> urgent and I might have missed something so correct me if I'm wrong.
> 
> 
> One small UX nit:
> 
> As the test case 'test-disarm-relocate1' documents, the 'disarm-ha'
> command takes priority over the migrate/relocate command. It is more of
> an edge case anyway that both a migration an disarm-ha command are
> issued in the same HA Manager cycle, but of course could still happen.
> 
> Handling the disarming as an admin task is more important than a
> user-initiated migration request. It would still be great that this
> action is relayed back to the user for the migration request (which
> might only see the hamigrate task but no following qmigrate task), but
> this is more an UX thing and should be handled e.g. as part of #6220 [0]
> so users can directly follow the HA request and the further actions.

yeah, UX could be slightly better, but as you hint, it's really a
pre-existing issue for a few cases already, so slightly orthogonal
to this.

> 
> [0] https://bugzilla.proxmox.com/show_bug.cgi?id=6220
> 
> On Sun Mar 22, 2026 at 12:42 AM CET, Thomas Lamprecht wrote:
>> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
>> index b1dbe6a..aa29858 100644
>> --- a/src/PVE/HA/Manager.pm
>> +++ b/src/PVE/HA/Manager.pm
>> @@ -631,6 +684,94 @@ sub try_persistent_group_migration {
>>      }
>>  }
>>  
>> +sub handle_disarm {
>> +    my ($self, $disarm, $ss, $lrm_modes) = @_;
>> +
>> +    my $haenv = $self->{haenv};
>> +    my $ns = $self->{ns};
>> +
>> +    # defer disarm if any services are in a transient state that needs the state machine to resolve
>> +    for my $sid (sort keys %$ss) {
>> +        my $state = $ss->{$sid}->{state};
>> +        if ($state eq 'fence' || $state eq 'recovery') {
>> +            $haenv->log(
>> +                'warn', "deferring disarm - service '$sid' is in '$state' state",
>> +            );
>> +            return 0; # let manage() continue so fence/recovery can progress
>> +        }
>> +        if ($state eq 'migrate' || $state eq 'relocate') {
>> +            $haenv->log(
>> +                'info', "deferring disarm - service '$sid' is in '$state' state",
>> +            );
>> +            return 0; # let manage() continue so migration can complete
>> +        }
>> +    }
> 
> Here the HA disarming is deferred so that the HA Manager can continue
> with processing the HA resources FSM if at least one of the HA resources
> is in one of the 4 transient states.
> 
> This can have side-effects for other HA resources as well, which
> currently aren't in one of these 4 transient states, but are implicitly
> in a transient state.
> 
> 
> For example:
> 
> There are two HA resources in the current service state.
> 
> {
>    ...
>    "service_status": {
>        "vm:101": {
>            "state": "started",
>            "node": "node1",
>            "cmd": [ "migrate", "node2" ],
>            "uid": "...",
>        },
>        "vm:102": {
>            "state": "...",
>            "node": "node1",
>            "uid": "...",
>        },
>    }
> }
> 
> 
> If vm:102 is in state 'started', then the HA Manager will start to
> disarm the HA stack right away and drop vm:101's migration request in
> both disarm modes.
> 
> If vm:102 is in any of the 4 transient states from above though, then
> the HA Manager will defer the disarm and process all HA resources, which
> will cause 'vm:101' to be put in 'migrate' now.
> 
> 
> So the transient state of one HA resource might cause other HA resources
> to be put in a transient state as well, even though I would have
> expected the deferring here to only resolve the existing transient
> states.
> 
> This also means that the HA Manager will handle user requests for HA
> resources in the same HA Manager cycle as a disarm request if there is
> at least one HA resource in one of the 4 transient states. This
> contradicts that disarming takes priority over the migrate/relocate
> request.

good find!

> Two other possible cases of 'implicitly transient' states might be:
> 
> - adding a HA rule, which makes a HA resource in 'started' state be put
>   in 'migrate' to another node when processing the select_service_node()
>   in next_state_started().
> 
> - the node of a HA resource is offline delayed in the same round as the
>   disarm request. If none of the HA resources are in a transient state
>   yet, the disarm request goes through, otherwise the affected HA
>   resources might be put in 'fence'.
> 
> 
> I haven't thought this through fully, but an option might be that we
> only allow the FSM processing of the HA resources, which are in one of
> these 4 transient states and don't process the others.
> 
> E.g. breaking out the FSM transition loop into its own function and in
> normal operation we iterate through all services in $ss, but for
> deferred disarming we only iterate through the HA resources in transient
> states, which should be resolved.

I pushed a follow-up [0] that should deal with this, another look at
that would be appreciated!
FWIW, I mostly pushed directly as I still wanted to do a bump+test upload
today, because if everything is good we get a small regression fix faster
out to users, if not we can follow-up here and no harm done.

[0]: https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=b6b025a268032ff5302bede1f5eb56247af13f21

[...]

>> +    if (my $disarm = $ms->{disarm}) {
>> +        if ($self->handle_disarm($disarm, $ss, $lrm_modes)) {
>> +            return; # disarm active and progressing, skip normal service state machine
>> +        }
>> +        # disarm deferred (e.g. due to active fencing) - fall through to let it complete
>> +    }
> 
> Regarding the load balancing patches, I would change this to
> 
> - }
> + } else {
> +     # load balance only if disarm is disabled as during a deferred disarm
> +     # the HA Manager should not introduce any new migrations
> +     $self->load_balance();
> + }
> 
> there.

Yeah, that's also what I did for the merge conflict when applying it your series
for review.

>> +
>> +    $self->{all_lrms_disarmed} = 0;
>> +
>>      for (;;) {
>>          my $repeat = 0;
> 
> 
> 
> 





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
  2026-03-25 15:50   ` Fiona Ebner
@ 2026-03-27  1:17     ` Thomas Lamprecht
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2026-03-27  1:17 UTC (permalink / raw)
  To: Fiona Ebner, pve-devel

Am 25.03.26 um 16:49 schrieb Fiona Ebner:
> Am 22.03.26 um 12:57 AM schrieb Thomas Lamprecht:
>> +    if ($mode eq 'freeze') {
>> +        for my $sid (sort keys %$ss) {
>> +            my $sd = $ss->{$sid};
>> +            my $state = $sd->{state};
>> +            next if $state eq 'freeze'; # already frozen
>> +            if (
>> +                $state eq 'started'
>> +                || $state eq 'stopped'
>> +                || $state eq 'request_stop'
>> +                || $state eq 'request_start'
>> +                || $state eq 'request_start_balance'
>> +                || $state eq 'error'
> Should it really happen for the 'error' state too? Because when
> re-arming, the state will become 'started':
> 
> Mar 25 16:20:06 pve9a1 pve-ha-crm[242553]: disarm: freezing service
> 'vm:400' (was 'error')
> ...
> Mar 25 16:20:36 pve9a1 pve-ha-crm[242553]: service 'vm:400': state
> changed from 'freeze' to 'started'
> 
> Which feels rather surprising to me. For comparison, after a cold
> cluster start, services in 'error' state are not (attempted to be)
> started either.

You're right, thanks for noticing this, fixed in:
https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=890673383c27a2949315a306218ef035042b46b0;hp=b6b025a268032ff5302bede1f5eb56247af13f21

Albeit, I'd like to revisit the error state and its being a final
status in the FSM, as somethings might be often better to just be
retried with a rate-limit forever in the context of HA. Anyhow,
definitively orthogonal to this series.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
  2026-03-26 23:15     ` Thomas Lamprecht
@ 2026-03-27 10:21       ` Daniel Kral
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel Kral @ 2026-03-27 10:21 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

On Fri Mar 27, 2026 at 12:15 AM CET, Thomas Lamprecht wrote:
> Am 26.03.26 um 17:02 schrieb Daniel Kral:
>> Two other possible cases of 'implicitly transient' states might be:
>> 
>> - adding a HA rule, which makes a HA resource in 'started' state be put
>>   in 'migrate' to another node when processing the select_service_node()
>>   in next_state_started().
>> 
>> - the node of a HA resource is offline delayed in the same round as the
>>   disarm request. If none of the HA resources are in a transient state
>>   yet, the disarm request goes through, otherwise the affected HA
>>   resources might be put in 'fence'.
>> 
>> 
>> I haven't thought this through fully, but an option might be that we
>> only allow the FSM processing of the HA resources, which are in one of
>> these 4 transient states and don't process the others.
>> 
>> E.g. breaking out the FSM transition loop into its own function and in
>> normal operation we iterate through all services in $ss, but for
>> deferred disarming we only iterate through the HA resources in transient
>> states, which should be resolved.
>
> I pushed a follow-up [0] that should deal with this, another look at
> that would be appreciated!
> FWIW, I mostly pushed directly as I still wanted to do a bump+test upload
> today, because if everything is good we get a small regression fix faster
> out to users, if not we can follow-up here and no harm done.
>
> [0]: https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=b6b025a268032ff5302bede1f5eb56247af13f21
>
> [...]
>

Thanks for the quick patch! The changes look good to me too and the test
case captures the behavior well!




^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-27 10:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 1/4] sim: hardware: add manual-migrate command for ignored services Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 2/4] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
2026-03-23 13:04   ` Dominik Rusovac
2026-03-25 15:50   ` Fiona Ebner
2026-03-27  1:17     ` Thomas Lamprecht
2026-03-26 16:02   ` Daniel Kral
2026-03-26 23:15     ` Thomas Lamprecht
2026-03-27 10:21       ` Daniel Kral
2026-03-21 23:42 ` [PATCH ha-manager v2 4/4] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
2026-03-23 13:05 ` [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac
2026-03-25 12:06 ` applied: " Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal