public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
@ 2026-05-19 14:38 Daniel Kral
  2026-05-19 14:38 ` [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources Daniel Kral
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Daniel Kral @ 2026-05-19 14:38 UTC (permalink / raw)
  To: pve-devel

As in patch message #2:

If there are HA resources, which are in transient states that defer the
disarming process, but their LRMs are already in idle state and disarmed
mode, these LRMs will not properly resolve the transient states of these
HA resources as assumed by the HA Manager.

For HA resources, which are still moving, this makes the HA Manager
stuck in a loop, which tries to defer the disarming process to wait for
a LRM response for these moving HA resources, which will never come as
the LRM is idle.

Therefore allow the LRM to become active in disarm mode if there are any
HA resources on the LRM's node, which are in any of these transient
states, and make sure that the LRM only processes the disarm-deferring
HA resources while the LRM is active.


Daniel Kral (2):
  test: add disarm test cases for idle lrms with transient ha resources
  make idle LRMs resolve leftover moving HA resources while disarmed

 src/PVE/HA/LRM.pm                             | 19 ++++++++-
 src/PVE/HA/Manager.pm                         |  8 ++--
 src/PVE/HA/Tools.pm                           | 17 ++++++++
 src/test/test-disarm-idle-lrm1/README         |  8 ++++
 src/test/test-disarm-idle-lrm1/cmdlist        |  3 ++
 .../test-disarm-idle-lrm1/hardware_status     |  5 +++
 src/test/test-disarm-idle-lrm1/log.expect     | 40 +++++++++++++++++++
 src/test/test-disarm-idle-lrm1/manager_status | 26 ++++++++++++
 src/test/test-disarm-idle-lrm1/service_config |  5 +++
 src/test/test-disarm-idle-lrm2/README         |  8 ++++
 src/test/test-disarm-idle-lrm2/cmdlist        |  3 ++
 .../test-disarm-idle-lrm2/hardware_status     |  5 +++
 src/test/test-disarm-idle-lrm2/log.expect     | 39 ++++++++++++++++++
 src/test/test-disarm-idle-lrm2/manager_status | 26 ++++++++++++
 src/test/test-disarm-idle-lrm2/service_config |  5 +++
 15 files changed, 211 insertions(+), 6 deletions(-)
 create mode 100644 src/test/test-disarm-idle-lrm1/README
 create mode 100644 src/test/test-disarm-idle-lrm1/cmdlist
 create mode 100644 src/test/test-disarm-idle-lrm1/hardware_status
 create mode 100644 src/test/test-disarm-idle-lrm1/log.expect
 create mode 100644 src/test/test-disarm-idle-lrm1/manager_status
 create mode 100644 src/test/test-disarm-idle-lrm1/service_config
 create mode 100644 src/test/test-disarm-idle-lrm2/README
 create mode 100644 src/test/test-disarm-idle-lrm2/cmdlist
 create mode 100644 src/test/test-disarm-idle-lrm2/hardware_status
 create mode 100644 src/test/test-disarm-idle-lrm2/log.expect
 create mode 100644 src/test/test-disarm-idle-lrm2/manager_status
 create mode 100644 src/test/test-disarm-idle-lrm2/service_config

-- 
2.47.3





^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources
  2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
@ 2026-05-19 14:38 ` Daniel Kral
  2026-05-19 14:38 ` [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Daniel Kral @ 2026-05-19 14:38 UTC (permalink / raw)
  To: pve-devel

These test cases document how the HA stack currently behaves if the HA
stack is disarmed while there are HA resources in disarm-deferring
states (fence, recovery, migrate, relocate) and their responsible LRMs
are already in idle state, which makes them irresponsive to the HA
Manager while they are in disarm mode.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
 src/test/test-disarm-idle-lrm1/README         |  8 +++
 src/test/test-disarm-idle-lrm1/cmdlist        |  3 +
 .../test-disarm-idle-lrm1/hardware_status     |  5 ++
 src/test/test-disarm-idle-lrm1/log.expect     | 59 +++++++++++++++++++
 src/test/test-disarm-idle-lrm1/manager_status | 26 ++++++++
 src/test/test-disarm-idle-lrm1/service_config |  5 ++
 src/test/test-disarm-idle-lrm2/README         |  8 +++
 src/test/test-disarm-idle-lrm2/cmdlist        |  3 +
 .../test-disarm-idle-lrm2/hardware_status     |  5 ++
 src/test/test-disarm-idle-lrm2/log.expect     | 56 ++++++++++++++++++
 src/test/test-disarm-idle-lrm2/manager_status | 26 ++++++++
 src/test/test-disarm-idle-lrm2/service_config |  5 ++
 12 files changed, 209 insertions(+)
 create mode 100644 src/test/test-disarm-idle-lrm1/README
 create mode 100644 src/test/test-disarm-idle-lrm1/cmdlist
 create mode 100644 src/test/test-disarm-idle-lrm1/hardware_status
 create mode 100644 src/test/test-disarm-idle-lrm1/log.expect
 create mode 100644 src/test/test-disarm-idle-lrm1/manager_status
 create mode 100644 src/test/test-disarm-idle-lrm1/service_config
 create mode 100644 src/test/test-disarm-idle-lrm2/README
 create mode 100644 src/test/test-disarm-idle-lrm2/cmdlist
 create mode 100644 src/test/test-disarm-idle-lrm2/hardware_status
 create mode 100644 src/test/test-disarm-idle-lrm2/log.expect
 create mode 100644 src/test/test-disarm-idle-lrm2/manager_status
 create mode 100644 src/test/test-disarm-idle-lrm2/service_config

diff --git a/src/test/test-disarm-idle-lrm1/README b/src/test/test-disarm-idle-lrm1/README
new file mode 100644
index 00000000..6d5124cd
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/README
@@ -0,0 +1,8 @@
+Disarm the HA stack in 'ignore' mode (keep HA resources in their previous
+state) while there is an HA resource in a transient state on a node, whose LRM
+is idle.
+
+Fenced HA resources are already handled by the HA Manager itself, so this works
+as expected. Though as the LRM for the moving HA resource is idle, the HA
+Manager doesn't get any LRM response for the HA resource, for which the
+disarming is deferred, and therefore the HA Manager is stuck in a loop.
diff --git a/src/test/test-disarm-idle-lrm1/cmdlist b/src/test/test-disarm-idle-lrm1/cmdlist
new file mode 100644
index 00000000..a29cf8e3
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/cmdlist
@@ -0,0 +1,3 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on", "crm node1 disarm-ha ignore" ]
+]
diff --git a/src/test/test-disarm-idle-lrm1/hardware_status b/src/test/test-disarm-idle-lrm1/hardware_status
new file mode 100644
index 00000000..451beb13
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-idle-lrm1/log.expect b/src/test/test-disarm-idle-lrm1/log.expect
new file mode 100644
index 00000000..1b7f4ece
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/log.expect
@@ -0,0 +1,59 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute crm node1 disarm-ha ignore
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: got crm command: disarm-ha ignore
+info     20    node1/crm: disarm: suspending HA tracking for service 'vm:101'
+info     20    node1/crm: disarm: suspending HA tracking for service 'vm:102'
+info     20    node1/crm: disarm: suspending HA tracking for service 'vm:103'
+warn     20    node1/crm: deferring disarm - service 'vm:102' is in 'fence' state
+info     20    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     20    node1/crm: got lock 'ha_agent_node2_lock'
+info     20    node1/crm: fencing: acknowledged - got agent lock for node 'node2'
+info     20    node1/crm: node 'node2': state changed from 'fence' => 'unknown'
+emai     20    node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node2'
+info     20    node1/crm: service 'vm:102': state changed from 'fence' to 'recovery'
+info     20    node1/crm: recover service 'vm:102' from fenced node 'node2' to node 'node1'
+info     20    node1/crm: service 'vm:102': state changed from 'recovery' to 'started'  (node = node1)
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info     40    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     40    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     60    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     80    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    100    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    120    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    140    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    160    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    180    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    200    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    220    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    240    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    260    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    280    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    300    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    320    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    340    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    360    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    380    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    400    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    420    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    440    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    460    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    480    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    500    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    520    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    540    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    560    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    580    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    600    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    620     hardware: exit simulation - done
diff --git a/src/test/test-disarm-idle-lrm1/manager_status b/src/test/test-disarm-idle-lrm1/manager_status
new file mode 100644
index 00000000..ba7448a0
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/manager_status
@@ -0,0 +1,26 @@
+{
+    "master_node": "node1",
+    "node_status": {
+	"node1":"online",
+	"node2":"fence",
+	"node3":"online"
+    },
+    "service_status": {
+	"vm:101": {
+	    "node": "node1",
+	    "state": "started",
+	    "uid": "lavE3c7vLnotUBGT9whswg"
+	},
+	"vm:102": {
+	    "node": "node2",
+	    "state": "fence",
+	    "uid": "lavE3c7vLnotUBGT9whswh"
+	},
+	"vm:103": {
+	    "node": "node3",
+	    "state": "migrate",
+	    "target": "node2",
+	    "uid": "lavE3c7vLnotUBGT9whswj"
+	}
+    }
+}
diff --git a/src/test/test-disarm-idle-lrm1/service_config b/src/test/test-disarm-idle-lrm1/service_config
new file mode 100644
index 00000000..4b26f6b4
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
diff --git a/src/test/test-disarm-idle-lrm2/README b/src/test/test-disarm-idle-lrm2/README
new file mode 100644
index 00000000..d1731578
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/README
@@ -0,0 +1,8 @@
+Disarm the HA stack in 'freeze' mode (keep HA resources frozen while disarmed)
+while there is an HA resource in a transient state on a node, whose LRM is
+idle.
+
+Fenced HA resources are already handled by the HA Manager itself, so this works
+as expected. Though as the LRM for the moving HA resource is idle, the HA
+Manager doesn't get any LRM response for the HA resource, for which the
+disarming is deferred, and therefore the HA Manager is stuck in a loop.
diff --git a/src/test/test-disarm-idle-lrm2/cmdlist b/src/test/test-disarm-idle-lrm2/cmdlist
new file mode 100644
index 00000000..5a46a662
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/cmdlist
@@ -0,0 +1,3 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on", "crm node1 disarm-ha freeze" ]
+]
diff --git a/src/test/test-disarm-idle-lrm2/hardware_status b/src/test/test-disarm-idle-lrm2/hardware_status
new file mode 100644
index 00000000..451beb13
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-idle-lrm2/log.expect b/src/test/test-disarm-idle-lrm2/log.expect
new file mode 100644
index 00000000..d0ba96ff
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/log.expect
@@ -0,0 +1,56 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute crm node1 disarm-ha freeze
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: got crm command: disarm-ha freeze
+warn     20    node1/crm: deferring disarm - service 'vm:102' is in 'fence' state
+info     20    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     20    node1/crm: got lock 'ha_agent_node2_lock'
+info     20    node1/crm: fencing: acknowledged - got agent lock for node 'node2'
+info     20    node1/crm: node 'node2': state changed from 'fence' => 'unknown'
+emai     20    node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node2'
+info     20    node1/crm: service 'vm:102': state changed from 'fence' to 'recovery'
+info     20    node1/crm: recover service 'vm:102' from fenced node 'node2' to node 'node1'
+info     20    node1/crm: service 'vm:102': state changed from 'recovery' to 'started'  (node = node1)
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info     40    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     40    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     60    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     80    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    100    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    120    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    140    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    160    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    180    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    200    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    220    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    240    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    260    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    280    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    300    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    320    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    340    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    360    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    380    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    400    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    420    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    440    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    460    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    480    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    500    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    520    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    540    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    560    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    580    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    600    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    620     hardware: exit simulation - done
diff --git a/src/test/test-disarm-idle-lrm2/manager_status b/src/test/test-disarm-idle-lrm2/manager_status
new file mode 100644
index 00000000..c28f4ffd
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/manager_status
@@ -0,0 +1,26 @@
+{
+    "master_node": "node1",
+    "node_status": {
+	"node1":"online",
+	"node2":"fence",
+	"node3":"online"
+    },
+    "service_status": {
+	"vm:101": {
+	    "node": "node1",
+	    "state": "online",
+	    "uid": "lavE3c7vLnotUBGT9whswg"
+	},
+	"vm:102": {
+	    "node": "node2",
+	    "state": "fence",
+	    "uid": "lavE3c7vLnotUBGT9whswh"
+	},
+	"vm:103": {
+	    "node": "node3",
+	    "state": "migrate",
+	    "target": "node2",
+	    "uid": "lavE3c7vLnotUBGT9whswj"
+	}
+    }
+}
diff --git a/src/test/test-disarm-idle-lrm2/service_config b/src/test/test-disarm-idle-lrm2/service_config
new file mode 100644
index 00000000..4b26f6b4
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed
  2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
  2026-05-19 14:38 ` [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources Daniel Kral
@ 2026-05-19 14:38 ` Daniel Kral
  2026-05-19 16:00   ` Fiona Ebner
  2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
  2026-05-19 20:11 ` applied: " Thomas Lamprecht
  3 siblings, 1 reply; 10+ messages in thread
From: Daniel Kral @ 2026-05-19 14:38 UTC (permalink / raw)
  To: pve-devel

If there are HA resources, which are in transient states that defer the
disarming process, but their LRMs are already in idle state and disarmed
mode, these LRMs will not properly resolve the transient states of these
HA resources as assumed by the HA Manager.

For HA resources, which are still moving, this makes the HA Manager
stuck in a loop, which tries to defer the disarming process to wait for
a LRM response for these moving HA resources, which will never come as
the LRM is idle.

Therefore allow the LRM to become active in disarm mode if there are any
HA resources on the LRM's node, which are in any of these transient
states, and make sure that the LRM only processes the disarm-deferring
HA resources while the LRM is active.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
 src/PVE/HA/LRM.pm                         | 19 ++++++++++-
 src/PVE/HA/Manager.pm                     |  8 ++---
 src/PVE/HA/Tools.pm                       | 17 ++++++++++
 src/test/test-disarm-idle-lrm1/log.expect | 37 ++++++---------------
 src/test/test-disarm-idle-lrm2/log.expect | 39 +++++++----------------
 5 files changed, 58 insertions(+), 62 deletions(-)

diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index 426982cc..9100d611 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -312,6 +312,18 @@ sub active_service_count {
     return PVE::HA::Tools::count_active_services($ss, $nodename);
 }
 
+# returns a truthy value if there are HA resources in transient states, which
+# need to be resolved, e.g. to complete the disarm procedure.
+sub has_disarm_deferred_services {
+    my ($self) = @_;
+
+    my $ss = $self->{service_status};
+    my $nodename = $self->{haenv}->nodename();
+    my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename);
+
+    return %$deferred_sids;
+}
+
 my $wrote_lrm_status_at_startup = 0;
 
 sub do_one_iteration {
@@ -371,7 +383,7 @@ sub work {
 
         my $service_count = $self->active_service_count();
 
-        if ($self->{mode} eq 'disarm') {
+        if ($self->{mode} eq 'disarm' && !$self->has_disarm_deferred_services()) {
             # stay idle while disarmed, don't acquire lock
         } elsif (!$fence_request && $service_count && $haenv->quorate()) {
             if ($self->get_protected_ha_agent_lock()) {
@@ -709,12 +721,17 @@ sub manage_resources {
     my $nodename = $haenv->nodename();
 
     my $ss = $self->{service_status};
+    my $deferred_sids;
+    $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename)
+        if $self->{mode} eq 'disarm';
 
     foreach my $sid (keys %{ $self->{restart_tries} }) {
         delete $self->{restart_tries}->{$sid} if !$ss->{$sid};
     }
 
     foreach my $sid (keys %$ss) {
+        next if $deferred_sids && !$deferred_sids->{$sid};
+
         my $sd = $ss->{$sid};
         next if !$sd->{node} || !$sd->{uid};
         next if $sd->{node} ne $nodename;
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 9b901c4f..a2baf349 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -929,15 +929,13 @@ sub handle_disarm {
     }
 
     # defer disarm if any services are in a transient state that needs the state machine to resolve
-    my $deferred_sids = {};
-    for my $sid (sort keys %$ss) {
+    my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss);
+    for my $sid (sort keys %$deferred_sids) {
         my $state = $ss->{$sid}->{state};
         if ($state eq 'fence' || $state eq 'recovery') {
             $haenv->log('warning', "deferring disarm - service '$sid' is in '$state' state");
-            $deferred_sids->{$sid} = 1;
-        } elsif ($state eq 'migrate' || $state eq 'relocate') {
+        } else {
             $haenv->log('info', "deferring disarm - service '$sid' is in '$state' state");
-            $deferred_sids->{$sid} = 1;
         }
     }
 
diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
index 26629fb5..37b27e11 100644
--- a/src/PVE/HA/Tools.pm
+++ b/src/PVE/HA/Tools.pm
@@ -213,6 +213,23 @@ sub count_active_services {
     return $active_count;
 }
 
+sub get_disarm_deferred_services {
+    my ($ss, $node) = @_;
+
+    my $deferred_sids = {};
+    my @deferrable_states = qw(fence recovery migrate relocate);
+
+    for my $sid (keys %$ss) {
+        my ($state, $current_node, $target_node) = $ss->{$sid}->@{qw(state node target)};
+
+        next if $node && (!$current_node || $current_node ne $node);
+
+        $deferred_sids->{$sid} = 1 if grep { $state eq $_ } @deferrable_states;
+    }
+
+    return $deferred_sids;
+}
+
 sub get_verbose_service_state {
     my ($service_state, $service_conf) = @_;
 
diff --git a/src/test/test-disarm-idle-lrm1/log.expect b/src/test/test-disarm-idle-lrm1/log.expect
index 1b7f4ece..d46fbebd 100644
--- a/src/test/test-disarm-idle-lrm1/log.expect
+++ b/src/test/test-disarm-idle-lrm1/log.expect
@@ -26,34 +26,15 @@ info     20    node1/crm: recover service 'vm:102' from fenced node 'node2' to n
 info     20    node1/crm: service 'vm:102': state changed from 'recovery' to 'started'  (node = node1)
 info     22    node2/crm: status change wait_for_quorum => slave
 info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: service vm:103 - start migrate to node 'node2'
+info     25    node3/lrm: service vm:103 - end migrate to node 'node2'
 info     40    node1/crm: node 'node2': state changed from 'unknown' => 'online'
 info     40    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info     60    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info     80    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    100    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    120    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    140    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    160    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    180    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    200    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    220    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    240    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    260    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    280    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    300    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    320    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    340    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    360    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    380    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    400    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    420    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    440    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    460    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    480    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    500    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    520    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    540    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    560    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    580    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    600    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     40    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node2)
+info     45    node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info     45    node3/lrm: status change active => wait_for_agent_lock
+info     60    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info     60    node1/crm: HA stack fully disarmed, releasing CRM watchdog
 info    620     hardware: exit simulation - done
diff --git a/src/test/test-disarm-idle-lrm2/log.expect b/src/test/test-disarm-idle-lrm2/log.expect
index d0ba96ff..13e3e2a7 100644
--- a/src/test/test-disarm-idle-lrm2/log.expect
+++ b/src/test/test-disarm-idle-lrm2/log.expect
@@ -23,34 +23,17 @@ info     20    node1/crm: recover service 'vm:102' from fenced node 'node2' to n
 info     20    node1/crm: service 'vm:102': state changed from 'recovery' to 'started'  (node = node1)
 info     22    node2/crm: status change wait_for_quorum => slave
 info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: service vm:103 - start migrate to node 'node2'
+info     25    node3/lrm: service vm:103 - end migrate to node 'node2'
 info     40    node1/crm: node 'node2': state changed from 'unknown' => 'online'
 info     40    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info     60    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info     80    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    100    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    120    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    140    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    160    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    180    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    200    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    220    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    240    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    260    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    280    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    300    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    320    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    340    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    360    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    380    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    400    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    420    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    440    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    460    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    480    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    500    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    520    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    540    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    560    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    580    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info    600    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     40    node1/crm: service 'vm:103': state changed from 'migrate' to 'started'  (node = node2)
+info     45    node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info     45    node3/lrm: status change active => wait_for_agent_lock
+info     60    node1/crm: disarm: freezing service 'vm:102' (was 'started')
+info     60    node1/crm: disarm: freezing service 'vm:103' (was 'started')
+info     60    node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info     60    node1/crm: HA stack fully disarmed, releasing CRM watchdog
 info    620     hardware: exit simulation - done
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
  2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
  2026-05-19 14:38 ` [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources Daniel Kral
  2026-05-19 14:38 ` [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
@ 2026-05-19 14:47 ` Daniel Kral
  2026-05-19 16:00   ` Fiona Ebner
  2026-05-19 20:11 ` applied: " Thomas Lamprecht
  3 siblings, 1 reply; 10+ messages in thread
From: Daniel Kral @ 2026-05-19 14:47 UTC (permalink / raw)
  To: Daniel Kral, pve-devel

On Tue May 19, 2026 at 4:38 PM CEST, Daniel Kral wrote:
> As in patch message #2:
>
> If there are HA resources, which are in transient states that defer the
> disarming process, but their LRMs are already in idle state and disarmed
> mode, these LRMs will not properly resolve the transient states of these
> HA resources as assumed by the HA Manager.
>
> For HA resources, which are still moving, this makes the HA Manager
> stuck in a loop, which tries to defer the disarming process to wait for
> a LRM response for these moving HA resources, which will never come as
> the LRM is idle.
>
> Therefore allow the LRM to become active in disarm mode if there are any
> HA resources on the LRM's node, which are in any of these transient
> states, and make sure that the LRM only processes the disarm-deferring
> HA resources while the LRM is active.

Sorry, forgot to add a this trailer to the series (can b4 pickup these
trailers as well?):

Reported-by: Max R. Carrara <m.carrara@proxmox.com>




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
  2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
@ 2026-05-19 16:00   ` Fiona Ebner
  0 siblings, 0 replies; 10+ messages in thread
From: Fiona Ebner @ 2026-05-19 16:00 UTC (permalink / raw)
  To: Daniel Kral, pve-devel

Am 19.05.26 um 4:47 PM schrieb Daniel Kral:
> On Tue May 19, 2026 at 4:38 PM CEST, Daniel Kral wrote:
>> As in patch message #2:
>>
>> If there are HA resources, which are in transient states that defer the
>> disarming process, but their LRMs are already in idle state and disarmed
>> mode, these LRMs will not properly resolve the transient states of these
>> HA resources as assumed by the HA Manager.
>>
>> For HA resources, which are still moving, this makes the HA Manager
>> stuck in a loop, which tries to defer the disarming process to wait for
>> a LRM response for these moving HA resources, which will never come as
>> the LRM is idle.
>>
>> Therefore allow the LRM to become active in disarm mode if there are any
>> HA resources on the LRM's node, which are in any of these transient
>> states, and make sure that the LRM only processes the disarm-deferring
>> HA resources while the LRM is active.
> 
> Sorry, forgot to add a this trailer to the series (can b4 pickup these
> trailers as well?):
> 
> Reported-by: Max R. Carrara <m.carrara@proxmox.com>

Yes, but only with an extra commandline option:

NOTE: some trailers ignored due to from/email mismatches:
    ! Trailer: Reported-by: Max R. Carrara <m.carrara@proxmox.com>
     Msg From: Daniel Kral <d.kral@proxmox.com>
NOTE: Rerun with -S to apply them anyway

and it would order them after your S-o-B.

FWIW, the patches look good to me otherwise, but see my reply to 2/2 for
some nits:

Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed
  2026-05-19 14:38 ` [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
@ 2026-05-19 16:00   ` Fiona Ebner
  2026-05-20  6:53     ` Daniel Kral
  0 siblings, 1 reply; 10+ messages in thread
From: Fiona Ebner @ 2026-05-19 16:00 UTC (permalink / raw)
  To: Daniel Kral, pve-devel

Am 19.05.26 um 4:39 PM schrieb Daniel Kral:
> If there are HA resources, which are in transient states that defer the
> disarming process, but their LRMs are already in idle state and disarmed
> mode, these LRMs will not properly resolve the transient states of these
> HA resources as assumed by the HA Manager.
> 
> For HA resources, which are still moving, this makes the HA Manager
> stuck in a loop, which tries to defer the disarming process to wait for
> a LRM response for these moving HA resources, which will never come as
> the LRM is idle.
> 
> Therefore allow the LRM to become active in disarm mode if there are any
> HA resources on the LRM's node, which are in any of these transient
> states, and make sure that the LRM only processes the disarm-deferring
> HA resources while the LRM is active.
> 
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
>  src/PVE/HA/LRM.pm                         | 19 ++++++++++-
>  src/PVE/HA/Manager.pm                     |  8 ++---
>  src/PVE/HA/Tools.pm                       | 17 ++++++++++
>  src/test/test-disarm-idle-lrm1/log.expect | 37 ++++++---------------
>  src/test/test-disarm-idle-lrm2/log.expect | 39 +++++++----------------
>  5 files changed, 58 insertions(+), 62 deletions(-)
> 
> diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
> index 426982cc..9100d611 100644
> --- a/src/PVE/HA/LRM.pm
> +++ b/src/PVE/HA/LRM.pm
> @@ -312,6 +312,18 @@ sub active_service_count {
>      return PVE::HA::Tools::count_active_services($ss, $nodename);
>  }
>  
> +# returns a truthy value if there are HA resources in transient states, which
> +# need to be resolved, e.g. to complete the disarm procedure.
> +sub has_disarm_deferred_services {

Nit: I feel like the variables and functions should rather be named
disarm_deferring rather than disarm_deferred

> +    my ($self) = @_;
> +
> +    my $ss = $self->{service_status};
> +    my $nodename = $self->{haenv}->nodename();
> +    my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename);
> +
> +    return %$deferred_sids;
> +}
> +
>  my $wrote_lrm_status_at_startup = 0;
>  
>  sub do_one_iteration {
> @@ -371,7 +383,7 @@ sub work {
>  
>          my $service_count = $self->active_service_count();
>  
> -        if ($self->{mode} eq 'disarm') {
> +        if ($self->{mode} eq 'disarm' && !$self->has_disarm_deferred_services()) {
>              # stay idle while disarmed, don't acquire lock
>          } elsif (!$fence_request && $service_count && $haenv->quorate()) {
>              if ($self->get_protected_ha_agent_lock()) {
> @@ -709,12 +721,17 @@ sub manage_resources {
>      my $nodename = $haenv->nodename();
>  
>      my $ss = $self->{service_status};
> +    my $deferred_sids;

Nit: Here, a full $disarm_deferring_sids would provide the most context.

> +    $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename)
> +        if $self->{mode} eq 'disarm';
>  
>      foreach my $sid (keys %{ $self->{restart_tries} }) {
>          delete $self->{restart_tries}->{$sid} if !$ss->{$sid};
>      }
>  
>      foreach my $sid (keys %$ss) {
> +        next if $deferred_sids && !$deferred_sids->{$sid};
> +
>          my $sd = $ss->{$sid};
>          next if !$sd->{node} || !$sd->{uid};
>          next if $sd->{node} ne $nodename;
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index 9b901c4f..a2baf349 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -929,15 +929,13 @@ sub handle_disarm {
>      }
>  
>      # defer disarm if any services are in a transient state that needs the state machine to resolve
> -    my $deferred_sids = {};
> -    for my $sid (sort keys %$ss) {
> +    my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss);
> +    for my $sid (sort keys %$deferred_sids) {
>          my $state = $ss->{$sid}->{state};
>          if ($state eq 'fence' || $state eq 'recovery') {
>              $haenv->log('warning', "deferring disarm - service '$sid' is in '$state' state");
> -            $deferred_sids->{$sid} = 1;
> -        } elsif ($state eq 'migrate' || $state eq 'relocate') {
> +        } else {
>              $haenv->log('info', "deferring disarm - service '$sid' is in '$state' state");
> -            $deferred_sids->{$sid} = 1;
>          }
>      }
>  
> diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
> index 26629fb5..37b27e11 100644
> --- a/src/PVE/HA/Tools.pm
> +++ b/src/PVE/HA/Tools.pm
> @@ -213,6 +213,23 @@ sub count_active_services {
>      return $active_count;
>  }
>  
> +sub get_disarm_deferred_services {
> +    my ($ss, $node) = @_;
> +
> +    my $deferred_sids = {};
> +    my @deferrable_states = qw(fence recovery migrate relocate);

Nit: disarm_deferring_states

> +
> +    for my $sid (keys %$ss) {
> +        my ($state, $current_node, $target_node) = $ss->{$sid}->@{qw(state node target)};
> +
> +        next if $node && (!$current_node || $current_node ne $node);

Just wondering: when does !$current_node happen?

> +
> +        $deferred_sids->{$sid} = 1 if grep { $state eq $_ } @deferrable_states;
> +    }
> +
> +    return $deferred_sids;
> +}
> +
>  sub get_verbose_service_state {
>      my ($service_state, $service_conf) = @_;
>  




^ permalink raw reply	[flat|nested] 10+ messages in thread

* applied: [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
  2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
                   ` (2 preceding siblings ...)
  2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
@ 2026-05-19 20:11 ` Thomas Lamprecht
  2026-05-20  8:07   ` Daniel Kral
  3 siblings, 1 reply; 10+ messages in thread
From: Thomas Lamprecht @ 2026-05-19 20:11 UTC (permalink / raw)
  To: pve-devel, Daniel Kral

On Tue, 19 May 2026 16:38:34 +0200, Daniel Kral wrote:
> As in patch message #2:
> 
> If there are HA resources, which are in transient states that defer the
> disarming process, but their LRMs are already in idle state and disarmed
> mode, these LRMs will not properly resolve the transient states of these
> HA resources as assumed by the HA Manager.
> 
> [...]

Applied, thanks!

Squashed Fiona's naming nits (disarm_deferring) into 2/2, added a contract doc
on the new Tools helper, and dropped the unused $target_node destructuring and
!$current_node guard while at it.

As I looked into this too a bit, and seemingly went for a different angle, I
added a follow-up to also freeze non-deferred services on the deferred cycle,
otherwise any LRM restart in the meantime (apt upgrade, manual restart) hangs
in 'restart' mode waiting for an active_service_count going to zero that due to
the deferred-mode CRM won't happen.

Bumped to 5.2.4.

[1/2] test: add disarm test cases for idle lrms with transient ha resources
      commit: 54497ae43d5bdb3842eb6f32e0642da9deb7cf5b
[2/2] make idle LRMs resolve leftover moving HA resources while disarmed
      commit: 9917559d01e60cc55622102bef2b8f5d87892d33




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed
  2026-05-19 16:00   ` Fiona Ebner
@ 2026-05-20  6:53     ` Daniel Kral
  2026-05-20  7:48       ` Daniel Kral
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Kral @ 2026-05-20  6:53 UTC (permalink / raw)
  To: Fiona Ebner, pve-devel

On Tue May 19, 2026 at 6:00 PM CEST, Fiona Ebner wrote:
> Am 19.05.26 um 4:39 PM schrieb Daniel Kral:
>> If there are HA resources, which are in transient states that defer the
>> disarming process, but their LRMs are already in idle state and disarmed
>> mode, these LRMs will not properly resolve the transient states of these
>> HA resources as assumed by the HA Manager.
>> 
>> For HA resources, which are still moving, this makes the HA Manager
>> stuck in a loop, which tries to defer the disarming process to wait for
>> a LRM response for these moving HA resources, which will never come as
>> the LRM is idle.
>> 
>> Therefore allow the LRM to become active in disarm mode if there are any
>> HA resources on the LRM's node, which are in any of these transient
>> states, and make sure that the LRM only processes the disarm-deferring
>> HA resources while the LRM is active.
>> 
>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>> ---
>>  src/PVE/HA/LRM.pm                         | 19 ++++++++++-
>>  src/PVE/HA/Manager.pm                     |  8 ++---
>>  src/PVE/HA/Tools.pm                       | 17 ++++++++++
>>  src/test/test-disarm-idle-lrm1/log.expect | 37 ++++++---------------
>>  src/test/test-disarm-idle-lrm2/log.expect | 39 +++++++----------------
>>  5 files changed, 58 insertions(+), 62 deletions(-)
>> 
>> diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
>> index 426982cc..9100d611 100644
>> --- a/src/PVE/HA/LRM.pm
>> +++ b/src/PVE/HA/LRM.pm
>> @@ -312,6 +312,18 @@ sub active_service_count {
>>      return PVE::HA::Tools::count_active_services($ss, $nodename);
>>  }
>>  
>> +# returns a truthy value if there are HA resources in transient states, which
>> +# need to be resolved, e.g. to complete the disarm procedure.
>> +sub has_disarm_deferred_services {
>
> Nit: I feel like the variables and functions should rather be named
> disarm_deferring rather than disarm_deferred

Thanks for the review, agree with all the renames!

[snip]

>> diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
>> index 26629fb5..37b27e11 100644
>> --- a/src/PVE/HA/Tools.pm
>> +++ b/src/PVE/HA/Tools.pm
>> @@ -213,6 +213,23 @@ sub count_active_services {
>>      return $active_count;
>>  }
>>  
>> +sub get_disarm_deferred_services {
>> +    my ($ss, $node) = @_;
>> +
>> +    my $deferred_sids = {};
>> +    my @deferrable_states = qw(fence recovery migrate relocate);
>
> Nit: disarm_deferring_states
>
>> +
>> +    for my $sid (keys %$ss) {
>> +        my ($state, $current_node, $target_node) = $ss->{$sid}->@{qw(state node target)};
>> +
>> +        next if $node && (!$current_node || $current_node ne $node);
>
> Just wondering: when does !$current_node happen?

AFAIK the only case where this can currently happen is if the HA
resource's guest doesn't exist in the cluster anymore according to the
pmxcfs' vmlist and isn't removed by HA Manager anymore (as is done when
the HA stack is in disarm mode).

>
>> +
>> +        $deferred_sids->{$sid} = 1 if grep { $state eq $_ } @deferrable_states;
>> +    }
>> +
>> +    return $deferred_sids;
>> +}
>> +
>>  sub get_verbose_service_state {
>>      my ($service_state, $service_conf) = @_;
>>  





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed
  2026-05-20  6:53     ` Daniel Kral
@ 2026-05-20  7:48       ` Daniel Kral
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Kral @ 2026-05-20  7:48 UTC (permalink / raw)
  To: Daniel Kral, Fiona Ebner, Thomas Lamprecht, pve-devel

On Wed May 20, 2026 at 8:53 AM CEST, Daniel Kral wrote:
> On Tue May 19, 2026 at 6:00 PM CEST, Fiona Ebner wrote:
>> Am 19.05.26 um 4:39 PM schrieb Daniel Kral:
>>> +
>>> +    for my $sid (keys %$ss) {
>>> +        my ($state, $current_node, $target_node) = $ss->{$sid}->@{qw(state node target)};
>>> +
>>> +        next if $node && (!$current_node || $current_node ne $node);
>>
>> Just wondering: when does !$current_node happen?
>
> AFAIK the only case where this can currently happen is if the HA
> resource's guest doesn't exist in the cluster anymore according to the
> pmxcfs' vmlist and isn't removed by HA Manager anymore (as is done when
> the HA stack is in disarm mode).

Sorry for the noise, had another look: the HA Manager never removes HA
resources that have an undef node (e.g. if the VM was removed in some
way that bypasses the check to also prune the HA resource from the
config) no matter if the HA stack is disarming or not:

    # jq '.service_status["vm:2000"]' /etc/pve/ha/manager_status
    {
      "node": "pve",
      "uid": "pHQkcW2HF1jeyQJ5JLb/8Q",
      "state": "stopped"
    }
    # mv /etc/pve/nodes/pve/qemu-server/2000.conf .
    # jq '.service_status["vm:2000"]' /etc/pve/ha/manager_status
    {
      "state": "stopped",
      "uid": "wtRkyVgpB7LcmCtqGBtf+w",
      "node": null
    }

As I tried it out a few times, this is also a cause why undef nodenames
get written to the manager_status and as there was never a timestamp for
the undef node entry the vm was tried to fenced which failed quite a few
assumptions in the HA Manager:

May 20 09:24:51 pve-2 pve-ha-crm[22795]: unable to score nodes according to dynamic usage for service 'vm:2000' - did not get dynamic service usage information for 'vm:2000'
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value in numeric comparison (<=>) at /usr/share/perl5/PVE/HA/Manager.pm line 390.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value in numeric comparison (<=>) at /usr/share/perl5/PVE/HA/Manager.pm line 390.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value in numeric comparison (<=>) at /usr/share/perl5/PVE/HA/Manager.pm line 390.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value in numeric comparison (<=>) at /usr/share/perl5/PVE/HA/Manager.pm line 390.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $current_node in string eq at /usr/share/perl5/PVE/HA/Manager.pm line 396.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $current_node in string eq at /usr/share/perl5/PVE/HA/Manager.pm line 396.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $current_node in string eq at /usr/share/perl5/PVE/HA/Manager.pm line 396.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $fenced_node in concatenation (.) or string at /usr/share/perl5/PVE/HA/Manager.pm line 1663.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $fenced_node in string eq at /usr/share/perl5/PVE/HA/Manager.pm line 1664.
May 20 09:24:51 pve-2 pve-ha-crm[22795]: recover service 'vm:2000' from fenced node '' to node 'pve'
May 20 09:24:51 pve-2 pve-ha-crm[22795]: got unexpected error - Configuration file 'nodes/pve-2/qemu-server/2000.conf' does not exist

This isn't something that should happen in normal circumstances though,
I'll send a patch doing more checks to be defensive and/or removing the
service status entry if the HA resource's guest isn't in the vmlist
anymore, though for the latter I'll have to check if that could cause
any trouble.

Furthermore, if the HA resource then gets fenced, the HA Manager will
acquire the lock for it's own node as get_ha_agent_lock($self, $node)
defaults to the current nodename if $node is undef.

Also might be worth to detect that HA resources have changed node
inbetween in the HA Manager, as it currently doesn't update the node at
all if it's moved manually and is already present in the HA Manager
status... I'll look into it further, as I already wanted to change this
behavior slightly for a partial fix of e.g. fenced HA resources, which
were migrated in the mean time [0].

[0] https://bugzilla.proxmox.com/show_bug.cgi?id=6610

>
>>
>>> +
>>> +        $deferred_sids->{$sid} = 1 if grep { $state eq $_ } @deferrable_states;
>>> +    }
>>> +
>>> +    return $deferred_sids;
>>> +}
>>> +
>>>  sub get_verbose_service_state {
>>>      my ($service_state, $service_conf) = @_;
>>>  





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: applied: [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
  2026-05-19 20:11 ` applied: " Thomas Lamprecht
@ 2026-05-20  8:07   ` Daniel Kral
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Kral @ 2026-05-20  8:07 UTC (permalink / raw)
  To: Thomas Lamprecht, pve-devel

On Tue May 19, 2026 at 10:11 PM CEST, Thomas Lamprecht wrote:
> On Tue, 19 May 2026 16:38:34 +0200, Daniel Kral wrote:
>> As in patch message #2:
>> 
>> If there are HA resources, which are in transient states that defer the
>> disarming process, but their LRMs are already in idle state and disarmed
>> mode, these LRMs will not properly resolve the transient states of these
>> HA resources as assumed by the HA Manager.
>> 
>> [...]
>
> Applied, thanks!
>
> Squashed Fiona's naming nits (disarm_deferring) into 2/2, added a contract doc
> on the new Tools helper, and dropped the unused $target_node destructuring and
> !$current_node guard while at it.

Thanks!

I started out also making disarmed LRMs active if an HA resource's
$target_node eq $node, but this wasn't required as the successful
migration is already confirmed by the $sd->{node} entry alone, which is
gathered from the pmxcfs itself, so no need to make the receiving LRM
active for that.

The !$current_node guard might still be worth (see my reply to Fiona's
review on patch 2/2) to be defensive against some edge cases.

>
> As I looked into this too a bit, and seemingly went for a different angle, I
> added a follow-up to also freeze non-deferred services on the deferred cycle,
> otherwise any LRM restart in the meantime (apt upgrade, manual restart) hangs
> in 'restart' mode waiting for an active_service_count going to zero that due to
> the deferred-mode CRM won't happen.

Thanks for the follow-ups and good catch that the frozen HA resources
are also accounted for the {basic,static} load accounting, good to have
a test cases for that! I'll also take more care to test with the other
shutdown modes in production!

>
> Bumped to 5.2.4.
>
> [1/2] test: add disarm test cases for idle lrms with transient ha resources
>       commit: 54497ae43d5bdb3842eb6f32e0642da9deb7cf5b
> [2/2] make idle LRMs resolve leftover moving HA resources while disarmed
>       commit: 9917559d01e60cc55622102bef2b8f5d87892d33





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-20  8:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
2026-05-19 16:00   ` Fiona Ebner
2026-05-20  6:53     ` Daniel Kral
2026-05-20  7:48       ` Daniel Kral
2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
2026-05-19 16:00   ` Fiona Ebner
2026-05-19 20:11 ` applied: " Thomas Lamprecht
2026-05-20  8:07   ` Daniel Kral

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal