* [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
@ 2026-05-19 14:38 Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources Daniel Kral
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Daniel Kral @ 2026-05-19 14:38 UTC (permalink / raw)
To: pve-devel
As in patch message #2:
If there are HA resources, which are in transient states that defer the
disarming process, but their LRMs are already in idle state and disarmed
mode, these LRMs will not properly resolve the transient states of these
HA resources as assumed by the HA Manager.
For HA resources, which are still moving, this makes the HA Manager
stuck in a loop, which tries to defer the disarming process to wait for
a LRM response for these moving HA resources, which will never come as
the LRM is idle.
Therefore allow the LRM to become active in disarm mode if there are any
HA resources on the LRM's node, which are in any of these transient
states, and make sure that the LRM only processes the disarm-deferring
HA resources while the LRM is active.
Daniel Kral (2):
test: add disarm test cases for idle lrms with transient ha resources
make idle LRMs resolve leftover moving HA resources while disarmed
src/PVE/HA/LRM.pm | 19 ++++++++-
src/PVE/HA/Manager.pm | 8 ++--
src/PVE/HA/Tools.pm | 17 ++++++++
src/test/test-disarm-idle-lrm1/README | 8 ++++
src/test/test-disarm-idle-lrm1/cmdlist | 3 ++
.../test-disarm-idle-lrm1/hardware_status | 5 +++
src/test/test-disarm-idle-lrm1/log.expect | 40 +++++++++++++++++++
src/test/test-disarm-idle-lrm1/manager_status | 26 ++++++++++++
src/test/test-disarm-idle-lrm1/service_config | 5 +++
src/test/test-disarm-idle-lrm2/README | 8 ++++
src/test/test-disarm-idle-lrm2/cmdlist | 3 ++
.../test-disarm-idle-lrm2/hardware_status | 5 +++
src/test/test-disarm-idle-lrm2/log.expect | 39 ++++++++++++++++++
src/test/test-disarm-idle-lrm2/manager_status | 26 ++++++++++++
src/test/test-disarm-idle-lrm2/service_config | 5 +++
15 files changed, 211 insertions(+), 6 deletions(-)
create mode 100644 src/test/test-disarm-idle-lrm1/README
create mode 100644 src/test/test-disarm-idle-lrm1/cmdlist
create mode 100644 src/test/test-disarm-idle-lrm1/hardware_status
create mode 100644 src/test/test-disarm-idle-lrm1/log.expect
create mode 100644 src/test/test-disarm-idle-lrm1/manager_status
create mode 100644 src/test/test-disarm-idle-lrm1/service_config
create mode 100644 src/test/test-disarm-idle-lrm2/README
create mode 100644 src/test/test-disarm-idle-lrm2/cmdlist
create mode 100644 src/test/test-disarm-idle-lrm2/hardware_status
create mode 100644 src/test/test-disarm-idle-lrm2/log.expect
create mode 100644 src/test/test-disarm-idle-lrm2/manager_status
create mode 100644 src/test/test-disarm-idle-lrm2/service_config
--
2.47.3
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources
2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
@ 2026-05-19 14:38 ` Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Daniel Kral @ 2026-05-19 14:38 UTC (permalink / raw)
To: pve-devel
These test cases document how the HA stack currently behaves if the HA
stack is disarmed while there are HA resources in disarm-deferring
states (fence, recovery, migrate, relocate) and their responsible LRMs
are already in idle state, which makes them irresponsive to the HA
Manager while they are in disarm mode.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/test/test-disarm-idle-lrm1/README | 8 +++
src/test/test-disarm-idle-lrm1/cmdlist | 3 +
.../test-disarm-idle-lrm1/hardware_status | 5 ++
src/test/test-disarm-idle-lrm1/log.expect | 59 +++++++++++++++++++
src/test/test-disarm-idle-lrm1/manager_status | 26 ++++++++
src/test/test-disarm-idle-lrm1/service_config | 5 ++
src/test/test-disarm-idle-lrm2/README | 8 +++
src/test/test-disarm-idle-lrm2/cmdlist | 3 +
.../test-disarm-idle-lrm2/hardware_status | 5 ++
src/test/test-disarm-idle-lrm2/log.expect | 56 ++++++++++++++++++
src/test/test-disarm-idle-lrm2/manager_status | 26 ++++++++
src/test/test-disarm-idle-lrm2/service_config | 5 ++
12 files changed, 209 insertions(+)
create mode 100644 src/test/test-disarm-idle-lrm1/README
create mode 100644 src/test/test-disarm-idle-lrm1/cmdlist
create mode 100644 src/test/test-disarm-idle-lrm1/hardware_status
create mode 100644 src/test/test-disarm-idle-lrm1/log.expect
create mode 100644 src/test/test-disarm-idle-lrm1/manager_status
create mode 100644 src/test/test-disarm-idle-lrm1/service_config
create mode 100644 src/test/test-disarm-idle-lrm2/README
create mode 100644 src/test/test-disarm-idle-lrm2/cmdlist
create mode 100644 src/test/test-disarm-idle-lrm2/hardware_status
create mode 100644 src/test/test-disarm-idle-lrm2/log.expect
create mode 100644 src/test/test-disarm-idle-lrm2/manager_status
create mode 100644 src/test/test-disarm-idle-lrm2/service_config
diff --git a/src/test/test-disarm-idle-lrm1/README b/src/test/test-disarm-idle-lrm1/README
new file mode 100644
index 00000000..6d5124cd
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/README
@@ -0,0 +1,8 @@
+Disarm the HA stack in 'ignore' mode (keep HA resources in their previous
+state) while there is an HA resource in a transient state on a node, whose LRM
+is idle.
+
+Fenced HA resources are already handled by the HA Manager itself, so this works
+as expected. Though as the LRM for the moving HA resource is idle, the HA
+Manager doesn't get any LRM response for the HA resource, for which the
+disarming is deferred, and therefore the HA Manager is stuck in a loop.
diff --git a/src/test/test-disarm-idle-lrm1/cmdlist b/src/test/test-disarm-idle-lrm1/cmdlist
new file mode 100644
index 00000000..a29cf8e3
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/cmdlist
@@ -0,0 +1,3 @@
+[
+ [ "power node1 on", "power node2 on", "power node3 on", "crm node1 disarm-ha ignore" ]
+]
diff --git a/src/test/test-disarm-idle-lrm1/hardware_status b/src/test/test-disarm-idle-lrm1/hardware_status
new file mode 100644
index 00000000..451beb13
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/hardware_status
@@ -0,0 +1,5 @@
+{
+ "node1": { "power": "off", "network": "off" },
+ "node2": { "power": "off", "network": "off" },
+ "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-idle-lrm1/log.expect b/src/test/test-disarm-idle-lrm1/log.expect
new file mode 100644
index 00000000..1b7f4ece
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/log.expect
@@ -0,0 +1,59 @@
+info 0 hardware: starting simulation
+info 20 cmdlist: execute power node1 on
+info 20 node1/crm: status change startup => wait_for_quorum
+info 20 node1/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node2 on
+info 20 node2/crm: status change startup => wait_for_quorum
+info 20 node2/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node3 on
+info 20 node3/crm: status change startup => wait_for_quorum
+info 20 node3/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute crm node1 disarm-ha ignore
+info 20 node1/crm: got lock 'ha_manager_lock'
+info 20 node1/crm: status change wait_for_quorum => master
+info 20 node1/crm: got crm command: disarm-ha ignore
+info 20 node1/crm: disarm: suspending HA tracking for service 'vm:101'
+info 20 node1/crm: disarm: suspending HA tracking for service 'vm:102'
+info 20 node1/crm: disarm: suspending HA tracking for service 'vm:103'
+warn 20 node1/crm: deferring disarm - service 'vm:102' is in 'fence' state
+info 20 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 20 node1/crm: got lock 'ha_agent_node2_lock'
+info 20 node1/crm: fencing: acknowledged - got agent lock for node 'node2'
+info 20 node1/crm: node 'node2': state changed from 'fence' => 'unknown'
+emai 20 node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node2'
+info 20 node1/crm: service 'vm:102': state changed from 'fence' to 'recovery'
+info 20 node1/crm: recover service 'vm:102' from fenced node 'node2' to node 'node1'
+info 20 node1/crm: service 'vm:102': state changed from 'recovery' to 'started' (node = node1)
+info 22 node2/crm: status change wait_for_quorum => slave
+info 24 node3/crm: status change wait_for_quorum => slave
+info 40 node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info 40 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 60 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 80 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 100 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 120 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 140 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 160 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 180 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 200 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 220 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 240 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 260 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 280 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 300 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 320 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 340 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 360 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 380 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 400 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 420 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 440 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 460 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 480 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 500 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 520 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 540 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 560 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 580 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 600 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 620 hardware: exit simulation - done
diff --git a/src/test/test-disarm-idle-lrm1/manager_status b/src/test/test-disarm-idle-lrm1/manager_status
new file mode 100644
index 00000000..ba7448a0
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/manager_status
@@ -0,0 +1,26 @@
+{
+ "master_node": "node1",
+ "node_status": {
+ "node1":"online",
+ "node2":"fence",
+ "node3":"online"
+ },
+ "service_status": {
+ "vm:101": {
+ "node": "node1",
+ "state": "started",
+ "uid": "lavE3c7vLnotUBGT9whswg"
+ },
+ "vm:102": {
+ "node": "node2",
+ "state": "fence",
+ "uid": "lavE3c7vLnotUBGT9whswh"
+ },
+ "vm:103": {
+ "node": "node3",
+ "state": "migrate",
+ "target": "node2",
+ "uid": "lavE3c7vLnotUBGT9whswj"
+ }
+ }
+}
diff --git a/src/test/test-disarm-idle-lrm1/service_config b/src/test/test-disarm-idle-lrm1/service_config
new file mode 100644
index 00000000..4b26f6b4
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/service_config
@@ -0,0 +1,5 @@
+{
+ "vm:101": { "node": "node1", "state": "started" },
+ "vm:102": { "node": "node2", "state": "started" },
+ "vm:103": { "node": "node3", "state": "started" }
+}
diff --git a/src/test/test-disarm-idle-lrm2/README b/src/test/test-disarm-idle-lrm2/README
new file mode 100644
index 00000000..d1731578
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/README
@@ -0,0 +1,8 @@
+Disarm the HA stack in 'freeze' mode (keep HA resources frozen while disarmed)
+while there is an HA resource in a transient state on a node, whose LRM is
+idle.
+
+Fenced HA resources are already handled by the HA Manager itself, so this works
+as expected. Though as the LRM for the moving HA resource is idle, the HA
+Manager doesn't get any LRM response for the HA resource, for which the
+disarming is deferred, and therefore the HA Manager is stuck in a loop.
diff --git a/src/test/test-disarm-idle-lrm2/cmdlist b/src/test/test-disarm-idle-lrm2/cmdlist
new file mode 100644
index 00000000..5a46a662
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/cmdlist
@@ -0,0 +1,3 @@
+[
+ [ "power node1 on", "power node2 on", "power node3 on", "crm node1 disarm-ha freeze" ]
+]
diff --git a/src/test/test-disarm-idle-lrm2/hardware_status b/src/test/test-disarm-idle-lrm2/hardware_status
new file mode 100644
index 00000000..451beb13
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/hardware_status
@@ -0,0 +1,5 @@
+{
+ "node1": { "power": "off", "network": "off" },
+ "node2": { "power": "off", "network": "off" },
+ "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-idle-lrm2/log.expect b/src/test/test-disarm-idle-lrm2/log.expect
new file mode 100644
index 00000000..d0ba96ff
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/log.expect
@@ -0,0 +1,56 @@
+info 0 hardware: starting simulation
+info 20 cmdlist: execute power node1 on
+info 20 node1/crm: status change startup => wait_for_quorum
+info 20 node1/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node2 on
+info 20 node2/crm: status change startup => wait_for_quorum
+info 20 node2/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node3 on
+info 20 node3/crm: status change startup => wait_for_quorum
+info 20 node3/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute crm node1 disarm-ha freeze
+info 20 node1/crm: got lock 'ha_manager_lock'
+info 20 node1/crm: status change wait_for_quorum => master
+info 20 node1/crm: got crm command: disarm-ha freeze
+warn 20 node1/crm: deferring disarm - service 'vm:102' is in 'fence' state
+info 20 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 20 node1/crm: got lock 'ha_agent_node2_lock'
+info 20 node1/crm: fencing: acknowledged - got agent lock for node 'node2'
+info 20 node1/crm: node 'node2': state changed from 'fence' => 'unknown'
+emai 20 node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node2'
+info 20 node1/crm: service 'vm:102': state changed from 'fence' to 'recovery'
+info 20 node1/crm: recover service 'vm:102' from fenced node 'node2' to node 'node1'
+info 20 node1/crm: service 'vm:102': state changed from 'recovery' to 'started' (node = node1)
+info 22 node2/crm: status change wait_for_quorum => slave
+info 24 node3/crm: status change wait_for_quorum => slave
+info 40 node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info 40 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 60 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 80 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 100 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 120 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 140 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 160 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 180 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 200 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 220 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 240 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 260 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 280 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 300 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 320 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 340 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 360 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 380 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 400 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 420 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 440 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 460 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 480 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 500 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 520 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 540 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 560 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 580 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 600 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 620 hardware: exit simulation - done
diff --git a/src/test/test-disarm-idle-lrm2/manager_status b/src/test/test-disarm-idle-lrm2/manager_status
new file mode 100644
index 00000000..c28f4ffd
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/manager_status
@@ -0,0 +1,26 @@
+{
+ "master_node": "node1",
+ "node_status": {
+ "node1":"online",
+ "node2":"fence",
+ "node3":"online"
+ },
+ "service_status": {
+ "vm:101": {
+ "node": "node1",
+ "state": "online",
+ "uid": "lavE3c7vLnotUBGT9whswg"
+ },
+ "vm:102": {
+ "node": "node2",
+ "state": "fence",
+ "uid": "lavE3c7vLnotUBGT9whswh"
+ },
+ "vm:103": {
+ "node": "node3",
+ "state": "migrate",
+ "target": "node2",
+ "uid": "lavE3c7vLnotUBGT9whswj"
+ }
+ }
+}
diff --git a/src/test/test-disarm-idle-lrm2/service_config b/src/test/test-disarm-idle-lrm2/service_config
new file mode 100644
index 00000000..4b26f6b4
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/service_config
@@ -0,0 +1,5 @@
+{
+ "vm:101": { "node": "node1", "state": "started" },
+ "vm:102": { "node": "node2", "state": "started" },
+ "vm:103": { "node": "node3", "state": "started" }
+}
--
2.47.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed
2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources Daniel Kral
@ 2026-05-19 14:38 ` Daniel Kral
2026-05-19 16:00 ` Fiona Ebner
2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
2026-05-19 20:11 ` applied: " Thomas Lamprecht
3 siblings, 1 reply; 7+ messages in thread
From: Daniel Kral @ 2026-05-19 14:38 UTC (permalink / raw)
To: pve-devel
If there are HA resources, which are in transient states that defer the
disarming process, but their LRMs are already in idle state and disarmed
mode, these LRMs will not properly resolve the transient states of these
HA resources as assumed by the HA Manager.
For HA resources, which are still moving, this makes the HA Manager
stuck in a loop, which tries to defer the disarming process to wait for
a LRM response for these moving HA resources, which will never come as
the LRM is idle.
Therefore allow the LRM to become active in disarm mode if there are any
HA resources on the LRM's node, which are in any of these transient
states, and make sure that the LRM only processes the disarm-deferring
HA resources while the LRM is active.
Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
src/PVE/HA/LRM.pm | 19 ++++++++++-
src/PVE/HA/Manager.pm | 8 ++---
src/PVE/HA/Tools.pm | 17 ++++++++++
src/test/test-disarm-idle-lrm1/log.expect | 37 ++++++---------------
src/test/test-disarm-idle-lrm2/log.expect | 39 +++++++----------------
5 files changed, 58 insertions(+), 62 deletions(-)
diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index 426982cc..9100d611 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -312,6 +312,18 @@ sub active_service_count {
return PVE::HA::Tools::count_active_services($ss, $nodename);
}
+# returns a truthy value if there are HA resources in transient states, which
+# need to be resolved, e.g. to complete the disarm procedure.
+sub has_disarm_deferred_services {
+ my ($self) = @_;
+
+ my $ss = $self->{service_status};
+ my $nodename = $self->{haenv}->nodename();
+ my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename);
+
+ return %$deferred_sids;
+}
+
my $wrote_lrm_status_at_startup = 0;
sub do_one_iteration {
@@ -371,7 +383,7 @@ sub work {
my $service_count = $self->active_service_count();
- if ($self->{mode} eq 'disarm') {
+ if ($self->{mode} eq 'disarm' && !$self->has_disarm_deferred_services()) {
# stay idle while disarmed, don't acquire lock
} elsif (!$fence_request && $service_count && $haenv->quorate()) {
if ($self->get_protected_ha_agent_lock()) {
@@ -709,12 +721,17 @@ sub manage_resources {
my $nodename = $haenv->nodename();
my $ss = $self->{service_status};
+ my $deferred_sids;
+ $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename)
+ if $self->{mode} eq 'disarm';
foreach my $sid (keys %{ $self->{restart_tries} }) {
delete $self->{restart_tries}->{$sid} if !$ss->{$sid};
}
foreach my $sid (keys %$ss) {
+ next if $deferred_sids && !$deferred_sids->{$sid};
+
my $sd = $ss->{$sid};
next if !$sd->{node} || !$sd->{uid};
next if $sd->{node} ne $nodename;
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 9b901c4f..a2baf349 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -929,15 +929,13 @@ sub handle_disarm {
}
# defer disarm if any services are in a transient state that needs the state machine to resolve
- my $deferred_sids = {};
- for my $sid (sort keys %$ss) {
+ my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss);
+ for my $sid (sort keys %$deferred_sids) {
my $state = $ss->{$sid}->{state};
if ($state eq 'fence' || $state eq 'recovery') {
$haenv->log('warning', "deferring disarm - service '$sid' is in '$state' state");
- $deferred_sids->{$sid} = 1;
- } elsif ($state eq 'migrate' || $state eq 'relocate') {
+ } else {
$haenv->log('info', "deferring disarm - service '$sid' is in '$state' state");
- $deferred_sids->{$sid} = 1;
}
}
diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
index 26629fb5..37b27e11 100644
--- a/src/PVE/HA/Tools.pm
+++ b/src/PVE/HA/Tools.pm
@@ -213,6 +213,23 @@ sub count_active_services {
return $active_count;
}
+sub get_disarm_deferred_services {
+ my ($ss, $node) = @_;
+
+ my $deferred_sids = {};
+ my @deferrable_states = qw(fence recovery migrate relocate);
+
+ for my $sid (keys %$ss) {
+ my ($state, $current_node, $target_node) = $ss->{$sid}->@{qw(state node target)};
+
+ next if $node && (!$current_node || $current_node ne $node);
+
+ $deferred_sids->{$sid} = 1 if grep { $state eq $_ } @deferrable_states;
+ }
+
+ return $deferred_sids;
+}
+
sub get_verbose_service_state {
my ($service_state, $service_conf) = @_;
diff --git a/src/test/test-disarm-idle-lrm1/log.expect b/src/test/test-disarm-idle-lrm1/log.expect
index 1b7f4ece..d46fbebd 100644
--- a/src/test/test-disarm-idle-lrm1/log.expect
+++ b/src/test/test-disarm-idle-lrm1/log.expect
@@ -26,34 +26,15 @@ info 20 node1/crm: recover service 'vm:102' from fenced node 'node2' to n
info 20 node1/crm: service 'vm:102': state changed from 'recovery' to 'started' (node = node1)
info 22 node2/crm: status change wait_for_quorum => slave
info 24 node3/crm: status change wait_for_quorum => slave
+info 25 node3/lrm: got lock 'ha_agent_node3_lock'
+info 25 node3/lrm: status change wait_for_agent_lock => active
+info 25 node3/lrm: service vm:103 - start migrate to node 'node2'
+info 25 node3/lrm: service vm:103 - end migrate to node 'node2'
info 40 node1/crm: node 'node2': state changed from 'unknown' => 'online'
info 40 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 60 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 80 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 100 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 120 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 140 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 160 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 180 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 200 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 220 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 240 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 260 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 280 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 300 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 320 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 340 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 360 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 380 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 400 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 420 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 440 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 460 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 480 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 500 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 520 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 540 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 560 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 580 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 600 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 40 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node2)
+info 45 node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info 45 node3/lrm: status change active => wait_for_agent_lock
+info 60 node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info 60 node1/crm: HA stack fully disarmed, releasing CRM watchdog
info 620 hardware: exit simulation - done
diff --git a/src/test/test-disarm-idle-lrm2/log.expect b/src/test/test-disarm-idle-lrm2/log.expect
index d0ba96ff..13e3e2a7 100644
--- a/src/test/test-disarm-idle-lrm2/log.expect
+++ b/src/test/test-disarm-idle-lrm2/log.expect
@@ -23,34 +23,17 @@ info 20 node1/crm: recover service 'vm:102' from fenced node 'node2' to n
info 20 node1/crm: service 'vm:102': state changed from 'recovery' to 'started' (node = node1)
info 22 node2/crm: status change wait_for_quorum => slave
info 24 node3/crm: status change wait_for_quorum => slave
+info 25 node3/lrm: got lock 'ha_agent_node3_lock'
+info 25 node3/lrm: status change wait_for_agent_lock => active
+info 25 node3/lrm: service vm:103 - start migrate to node 'node2'
+info 25 node3/lrm: service vm:103 - end migrate to node 'node2'
info 40 node1/crm: node 'node2': state changed from 'unknown' => 'online'
info 40 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 60 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 80 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 100 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 120 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 140 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 160 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 180 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 200 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 220 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 240 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 260 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 280 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 300 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 320 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 340 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 360 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 380 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 400 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 420 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 440 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 460 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 480 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 500 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 520 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 540 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 560 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 580 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
-info 600 node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info 40 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node2)
+info 45 node3/lrm: HA disarm requested, releasing agent lock and watchdog
+info 45 node3/lrm: status change active => wait_for_agent_lock
+info 60 node1/crm: disarm: freezing service 'vm:102' (was 'started')
+info 60 node1/crm: disarm: freezing service 'vm:103' (was 'started')
+info 60 node1/crm: all LRMs disarmed, HA stack is now fully disarmed
+info 60 node1/crm: HA stack fully disarmed, releasing CRM watchdog
info 620 hardware: exit simulation - done
--
2.47.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
@ 2026-05-19 14:47 ` Daniel Kral
2026-05-19 16:00 ` Fiona Ebner
2026-05-19 20:11 ` applied: " Thomas Lamprecht
3 siblings, 1 reply; 7+ messages in thread
From: Daniel Kral @ 2026-05-19 14:47 UTC (permalink / raw)
To: Daniel Kral, pve-devel
On Tue May 19, 2026 at 4:38 PM CEST, Daniel Kral wrote:
> As in patch message #2:
>
> If there are HA resources, which are in transient states that defer the
> disarming process, but their LRMs are already in idle state and disarmed
> mode, these LRMs will not properly resolve the transient states of these
> HA resources as assumed by the HA Manager.
>
> For HA resources, which are still moving, this makes the HA Manager
> stuck in a loop, which tries to defer the disarming process to wait for
> a LRM response for these moving HA resources, which will never come as
> the LRM is idle.
>
> Therefore allow the LRM to become active in disarm mode if there are any
> HA resources on the LRM's node, which are in any of these transient
> states, and make sure that the LRM only processes the disarm-deferring
> HA resources while the LRM is active.
Sorry, forgot to add a this trailer to the series (can b4 pickup these
trailers as well?):
Reported-by: Max R. Carrara <m.carrara@proxmox.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
@ 2026-05-19 16:00 ` Fiona Ebner
0 siblings, 0 replies; 7+ messages in thread
From: Fiona Ebner @ 2026-05-19 16:00 UTC (permalink / raw)
To: Daniel Kral, pve-devel
Am 19.05.26 um 4:47 PM schrieb Daniel Kral:
> On Tue May 19, 2026 at 4:38 PM CEST, Daniel Kral wrote:
>> As in patch message #2:
>>
>> If there are HA resources, which are in transient states that defer the
>> disarming process, but their LRMs are already in idle state and disarmed
>> mode, these LRMs will not properly resolve the transient states of these
>> HA resources as assumed by the HA Manager.
>>
>> For HA resources, which are still moving, this makes the HA Manager
>> stuck in a loop, which tries to defer the disarming process to wait for
>> a LRM response for these moving HA resources, which will never come as
>> the LRM is idle.
>>
>> Therefore allow the LRM to become active in disarm mode if there are any
>> HA resources on the LRM's node, which are in any of these transient
>> states, and make sure that the LRM only processes the disarm-deferring
>> HA resources while the LRM is active.
>
> Sorry, forgot to add a this trailer to the series (can b4 pickup these
> trailers as well?):
>
> Reported-by: Max R. Carrara <m.carrara@proxmox.com>
Yes, but only with an extra commandline option:
NOTE: some trailers ignored due to from/email mismatches:
! Trailer: Reported-by: Max R. Carrara <m.carrara@proxmox.com>
Msg From: Daniel Kral <d.kral@proxmox.com>
NOTE: Rerun with -S to apply them anyway
and it would order them after your S-o-B.
FWIW, the patches look good to me otherwise, but see my reply to 2/2 for
some nits:
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed
2026-05-19 14:38 ` [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
@ 2026-05-19 16:00 ` Fiona Ebner
0 siblings, 0 replies; 7+ messages in thread
From: Fiona Ebner @ 2026-05-19 16:00 UTC (permalink / raw)
To: Daniel Kral, pve-devel
Am 19.05.26 um 4:39 PM schrieb Daniel Kral:
> If there are HA resources, which are in transient states that defer the
> disarming process, but their LRMs are already in idle state and disarmed
> mode, these LRMs will not properly resolve the transient states of these
> HA resources as assumed by the HA Manager.
>
> For HA resources, which are still moving, this makes the HA Manager
> stuck in a loop, which tries to defer the disarming process to wait for
> a LRM response for these moving HA resources, which will never come as
> the LRM is idle.
>
> Therefore allow the LRM to become active in disarm mode if there are any
> HA resources on the LRM's node, which are in any of these transient
> states, and make sure that the LRM only processes the disarm-deferring
> HA resources while the LRM is active.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
> src/PVE/HA/LRM.pm | 19 ++++++++++-
> src/PVE/HA/Manager.pm | 8 ++---
> src/PVE/HA/Tools.pm | 17 ++++++++++
> src/test/test-disarm-idle-lrm1/log.expect | 37 ++++++---------------
> src/test/test-disarm-idle-lrm2/log.expect | 39 +++++++----------------
> 5 files changed, 58 insertions(+), 62 deletions(-)
>
> diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
> index 426982cc..9100d611 100644
> --- a/src/PVE/HA/LRM.pm
> +++ b/src/PVE/HA/LRM.pm
> @@ -312,6 +312,18 @@ sub active_service_count {
> return PVE::HA::Tools::count_active_services($ss, $nodename);
> }
>
> +# returns a truthy value if there are HA resources in transient states, which
> +# need to be resolved, e.g. to complete the disarm procedure.
> +sub has_disarm_deferred_services {
Nit: I feel like the variables and functions should rather be named
disarm_deferring rather than disarm_deferred
> + my ($self) = @_;
> +
> + my $ss = $self->{service_status};
> + my $nodename = $self->{haenv}->nodename();
> + my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename);
> +
> + return %$deferred_sids;
> +}
> +
> my $wrote_lrm_status_at_startup = 0;
>
> sub do_one_iteration {
> @@ -371,7 +383,7 @@ sub work {
>
> my $service_count = $self->active_service_count();
>
> - if ($self->{mode} eq 'disarm') {
> + if ($self->{mode} eq 'disarm' && !$self->has_disarm_deferred_services()) {
> # stay idle while disarmed, don't acquire lock
> } elsif (!$fence_request && $service_count && $haenv->quorate()) {
> if ($self->get_protected_ha_agent_lock()) {
> @@ -709,12 +721,17 @@ sub manage_resources {
> my $nodename = $haenv->nodename();
>
> my $ss = $self->{service_status};
> + my $deferred_sids;
Nit: Here, a full $disarm_deferring_sids would provide the most context.
> + $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename)
> + if $self->{mode} eq 'disarm';
>
> foreach my $sid (keys %{ $self->{restart_tries} }) {
> delete $self->{restart_tries}->{$sid} if !$ss->{$sid};
> }
>
> foreach my $sid (keys %$ss) {
> + next if $deferred_sids && !$deferred_sids->{$sid};
> +
> my $sd = $ss->{$sid};
> next if !$sd->{node} || !$sd->{uid};
> next if $sd->{node} ne $nodename;
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index 9b901c4f..a2baf349 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -929,15 +929,13 @@ sub handle_disarm {
> }
>
> # defer disarm if any services are in a transient state that needs the state machine to resolve
> - my $deferred_sids = {};
> - for my $sid (sort keys %$ss) {
> + my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss);
> + for my $sid (sort keys %$deferred_sids) {
> my $state = $ss->{$sid}->{state};
> if ($state eq 'fence' || $state eq 'recovery') {
> $haenv->log('warning', "deferring disarm - service '$sid' is in '$state' state");
> - $deferred_sids->{$sid} = 1;
> - } elsif ($state eq 'migrate' || $state eq 'relocate') {
> + } else {
> $haenv->log('info', "deferring disarm - service '$sid' is in '$state' state");
> - $deferred_sids->{$sid} = 1;
> }
> }
>
> diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm
> index 26629fb5..37b27e11 100644
> --- a/src/PVE/HA/Tools.pm
> +++ b/src/PVE/HA/Tools.pm
> @@ -213,6 +213,23 @@ sub count_active_services {
> return $active_count;
> }
>
> +sub get_disarm_deferred_services {
> + my ($ss, $node) = @_;
> +
> + my $deferred_sids = {};
> + my @deferrable_states = qw(fence recovery migrate relocate);
Nit: disarm_deferring_states
> +
> + for my $sid (keys %$ss) {
> + my ($state, $current_node, $target_node) = $ss->{$sid}->@{qw(state node target)};
> +
> + next if $node && (!$current_node || $current_node ne $node);
Just wondering: when does !$current_node happen?
> +
> + $deferred_sids->{$sid} = 1 if grep { $state eq $_ } @deferrable_states;
> + }
> +
> + return $deferred_sids;
> +}
> +
> sub get_verbose_service_state {
> my ($service_state, $service_conf) = @_;
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* applied: [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed
2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
` (2 preceding siblings ...)
2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
@ 2026-05-19 20:11 ` Thomas Lamprecht
3 siblings, 0 replies; 7+ messages in thread
From: Thomas Lamprecht @ 2026-05-19 20:11 UTC (permalink / raw)
To: pve-devel, Daniel Kral
On Tue, 19 May 2026 16:38:34 +0200, Daniel Kral wrote:
> As in patch message #2:
>
> If there are HA resources, which are in transient states that defer the
> disarming process, but their LRMs are already in idle state and disarmed
> mode, these LRMs will not properly resolve the transient states of these
> HA resources as assumed by the HA Manager.
>
> [...]
Applied, thanks!
Squashed Fiona's naming nits (disarm_deferring) into 2/2, added a contract doc
on the new Tools helper, and dropped the unused $target_node destructuring and
!$current_node guard while at it.
As I looked into this too a bit, and seemingly went for a different angle, I
added a follow-up to also freeze non-deferred services on the deferred cycle,
otherwise any LRM restart in the meantime (apt upgrade, manual restart) hangs
in 'restart' mode waiting for an active_service_count going to zero that due to
the deferred-mode CRM won't happen.
Bumped to 5.2.4.
[1/2] test: add disarm test cases for idle lrms with transient ha resources
commit: 54497ae43d5bdb3842eb6f32e0642da9deb7cf5b
[2/2] make idle LRMs resolve leftover moving HA resources while disarmed
commit: 9917559d01e60cc55622102bef2b8f5d87892d33
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-05-19 20:17 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources Daniel Kral
2026-05-19 14:38 ` [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
2026-05-19 16:00 ` Fiona Ebner
2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
2026-05-19 16:00 ` Fiona Ebner
2026-05-19 20:11 ` applied: " Thomas Lamprecht
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.