public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH ha-manager 1/2] test: add disarm test cases for idle lrms with transient ha resources
Date: Tue, 19 May 2026 16:38:35 +0200	[thread overview]
Message-ID: <20260519143842.382324-2-d.kral@proxmox.com> (raw)
In-Reply-To: <20260519143842.382324-1-d.kral@proxmox.com>

These test cases document how the HA stack currently behaves if the HA
stack is disarmed while there are HA resources in disarm-deferring
states (fence, recovery, migrate, relocate) and their responsible LRMs
are already in idle state, which makes them irresponsive to the HA
Manager while they are in disarm mode.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
 src/test/test-disarm-idle-lrm1/README         |  8 +++
 src/test/test-disarm-idle-lrm1/cmdlist        |  3 +
 .../test-disarm-idle-lrm1/hardware_status     |  5 ++
 src/test/test-disarm-idle-lrm1/log.expect     | 59 +++++++++++++++++++
 src/test/test-disarm-idle-lrm1/manager_status | 26 ++++++++
 src/test/test-disarm-idle-lrm1/service_config |  5 ++
 src/test/test-disarm-idle-lrm2/README         |  8 +++
 src/test/test-disarm-idle-lrm2/cmdlist        |  3 +
 .../test-disarm-idle-lrm2/hardware_status     |  5 ++
 src/test/test-disarm-idle-lrm2/log.expect     | 56 ++++++++++++++++++
 src/test/test-disarm-idle-lrm2/manager_status | 26 ++++++++
 src/test/test-disarm-idle-lrm2/service_config |  5 ++
 12 files changed, 209 insertions(+)
 create mode 100644 src/test/test-disarm-idle-lrm1/README
 create mode 100644 src/test/test-disarm-idle-lrm1/cmdlist
 create mode 100644 src/test/test-disarm-idle-lrm1/hardware_status
 create mode 100644 src/test/test-disarm-idle-lrm1/log.expect
 create mode 100644 src/test/test-disarm-idle-lrm1/manager_status
 create mode 100644 src/test/test-disarm-idle-lrm1/service_config
 create mode 100644 src/test/test-disarm-idle-lrm2/README
 create mode 100644 src/test/test-disarm-idle-lrm2/cmdlist
 create mode 100644 src/test/test-disarm-idle-lrm2/hardware_status
 create mode 100644 src/test/test-disarm-idle-lrm2/log.expect
 create mode 100644 src/test/test-disarm-idle-lrm2/manager_status
 create mode 100644 src/test/test-disarm-idle-lrm2/service_config

diff --git a/src/test/test-disarm-idle-lrm1/README b/src/test/test-disarm-idle-lrm1/README
new file mode 100644
index 00000000..6d5124cd
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/README
@@ -0,0 +1,8 @@
+Disarm the HA stack in 'ignore' mode (keep HA resources in their previous
+state) while there is an HA resource in a transient state on a node, whose LRM
+is idle.
+
+Fenced HA resources are already handled by the HA Manager itself, so this works
+as expected. Though as the LRM for the moving HA resource is idle, the HA
+Manager doesn't get any LRM response for the HA resource, for which the
+disarming is deferred, and therefore the HA Manager is stuck in a loop.
diff --git a/src/test/test-disarm-idle-lrm1/cmdlist b/src/test/test-disarm-idle-lrm1/cmdlist
new file mode 100644
index 00000000..a29cf8e3
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/cmdlist
@@ -0,0 +1,3 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on", "crm node1 disarm-ha ignore" ]
+]
diff --git a/src/test/test-disarm-idle-lrm1/hardware_status b/src/test/test-disarm-idle-lrm1/hardware_status
new file mode 100644
index 00000000..451beb13
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-idle-lrm1/log.expect b/src/test/test-disarm-idle-lrm1/log.expect
new file mode 100644
index 00000000..1b7f4ece
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/log.expect
@@ -0,0 +1,59 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute crm node1 disarm-ha ignore
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: got crm command: disarm-ha ignore
+info     20    node1/crm: disarm: suspending HA tracking for service 'vm:101'
+info     20    node1/crm: disarm: suspending HA tracking for service 'vm:102'
+info     20    node1/crm: disarm: suspending HA tracking for service 'vm:103'
+warn     20    node1/crm: deferring disarm - service 'vm:102' is in 'fence' state
+info     20    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     20    node1/crm: got lock 'ha_agent_node2_lock'
+info     20    node1/crm: fencing: acknowledged - got agent lock for node 'node2'
+info     20    node1/crm: node 'node2': state changed from 'fence' => 'unknown'
+emai     20    node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node2'
+info     20    node1/crm: service 'vm:102': state changed from 'fence' to 'recovery'
+info     20    node1/crm: recover service 'vm:102' from fenced node 'node2' to node 'node1'
+info     20    node1/crm: service 'vm:102': state changed from 'recovery' to 'started'  (node = node1)
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info     40    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     40    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     60    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     80    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    100    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    120    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    140    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    160    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    180    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    200    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    220    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    240    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    260    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    280    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    300    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    320    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    340    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    360    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    380    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    400    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    420    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    440    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    460    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    480    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    500    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    520    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    540    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    560    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    580    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    600    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    620     hardware: exit simulation - done
diff --git a/src/test/test-disarm-idle-lrm1/manager_status b/src/test/test-disarm-idle-lrm1/manager_status
new file mode 100644
index 00000000..ba7448a0
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/manager_status
@@ -0,0 +1,26 @@
+{
+    "master_node": "node1",
+    "node_status": {
+	"node1":"online",
+	"node2":"fence",
+	"node3":"online"
+    },
+    "service_status": {
+	"vm:101": {
+	    "node": "node1",
+	    "state": "started",
+	    "uid": "lavE3c7vLnotUBGT9whswg"
+	},
+	"vm:102": {
+	    "node": "node2",
+	    "state": "fence",
+	    "uid": "lavE3c7vLnotUBGT9whswh"
+	},
+	"vm:103": {
+	    "node": "node3",
+	    "state": "migrate",
+	    "target": "node2",
+	    "uid": "lavE3c7vLnotUBGT9whswj"
+	}
+    }
+}
diff --git a/src/test/test-disarm-idle-lrm1/service_config b/src/test/test-disarm-idle-lrm1/service_config
new file mode 100644
index 00000000..4b26f6b4
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm1/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
diff --git a/src/test/test-disarm-idle-lrm2/README b/src/test/test-disarm-idle-lrm2/README
new file mode 100644
index 00000000..d1731578
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/README
@@ -0,0 +1,8 @@
+Disarm the HA stack in 'freeze' mode (keep HA resources frozen while disarmed)
+while there is an HA resource in a transient state on a node, whose LRM is
+idle.
+
+Fenced HA resources are already handled by the HA Manager itself, so this works
+as expected. Though as the LRM for the moving HA resource is idle, the HA
+Manager doesn't get any LRM response for the HA resource, for which the
+disarming is deferred, and therefore the HA Manager is stuck in a loop.
diff --git a/src/test/test-disarm-idle-lrm2/cmdlist b/src/test/test-disarm-idle-lrm2/cmdlist
new file mode 100644
index 00000000..5a46a662
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/cmdlist
@@ -0,0 +1,3 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on", "crm node1 disarm-ha freeze" ]
+]
diff --git a/src/test/test-disarm-idle-lrm2/hardware_status b/src/test/test-disarm-idle-lrm2/hardware_status
new file mode 100644
index 00000000..451beb13
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-disarm-idle-lrm2/log.expect b/src/test/test-disarm-idle-lrm2/log.expect
new file mode 100644
index 00000000..d0ba96ff
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/log.expect
@@ -0,0 +1,56 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute crm node1 disarm-ha freeze
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: got crm command: disarm-ha freeze
+warn     20    node1/crm: deferring disarm - service 'vm:102' is in 'fence' state
+info     20    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     20    node1/crm: got lock 'ha_agent_node2_lock'
+info     20    node1/crm: fencing: acknowledged - got agent lock for node 'node2'
+info     20    node1/crm: node 'node2': state changed from 'fence' => 'unknown'
+emai     20    node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node2'
+info     20    node1/crm: service 'vm:102': state changed from 'fence' to 'recovery'
+info     20    node1/crm: recover service 'vm:102' from fenced node 'node2' to node 'node1'
+info     20    node1/crm: service 'vm:102': state changed from 'recovery' to 'started'  (node = node1)
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info     40    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     40    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     60    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info     80    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    100    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    120    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    140    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    160    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    180    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    200    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    220    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    240    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    260    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    280    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    300    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    320    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    340    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    360    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    380    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    400    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    420    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    440    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    460    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    480    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    500    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    520    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    540    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    560    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    580    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    600    node1/crm: deferring disarm - service 'vm:103' is in 'migrate' state
+info    620     hardware: exit simulation - done
diff --git a/src/test/test-disarm-idle-lrm2/manager_status b/src/test/test-disarm-idle-lrm2/manager_status
new file mode 100644
index 00000000..c28f4ffd
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/manager_status
@@ -0,0 +1,26 @@
+{
+    "master_node": "node1",
+    "node_status": {
+	"node1":"online",
+	"node2":"fence",
+	"node3":"online"
+    },
+    "service_status": {
+	"vm:101": {
+	    "node": "node1",
+	    "state": "online",
+	    "uid": "lavE3c7vLnotUBGT9whswg"
+	},
+	"vm:102": {
+	    "node": "node2",
+	    "state": "fence",
+	    "uid": "lavE3c7vLnotUBGT9whswh"
+	},
+	"vm:103": {
+	    "node": "node3",
+	    "state": "migrate",
+	    "target": "node2",
+	    "uid": "lavE3c7vLnotUBGT9whswj"
+	}
+    }
+}
diff --git a/src/test/test-disarm-idle-lrm2/service_config b/src/test/test-disarm-idle-lrm2/service_config
new file mode 100644
index 00000000..4b26f6b4
--- /dev/null
+++ b/src/test/test-disarm-idle-lrm2/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node2", "state": "started" },
+    "vm:103": { "node": "node3", "state": "started" }
+}
-- 
2.47.3





  reply	other threads:[~2026-05-19 14:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19 14:38 [PATCH-SERIES ha-manager 0/2] make idle LRMs resolve leftover moving HA resources while disarmed Daniel Kral
2026-05-19 14:38 ` Daniel Kral [this message]
2026-05-19 14:38 ` [PATCH ha-manager 2/2] " Daniel Kral
2026-05-19 16:00   ` Fiona Ebner
2026-05-19 14:47 ` [PATCH-SERIES ha-manager 0/2] " Daniel Kral
2026-05-19 16:00   ` Fiona Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260519143842.382324-2-d.kral@proxmox.com \
    --to=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal