From: Fiona Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH v2 ha-manager 1/2] tests: simulate stale maintainance node caused by simultaneous cluster shutdown
Date: Mon, 12 Jun 2023 17:27:10 +0200 [thread overview]
Message-ID: <20230612152711.97078-1-f.ebner@proxmox.com> (raw)
In the test log, it can be seen that the service will unexpectedly be
migrated back. This is caused by the service's maintainance node
property being set by the initial shutdown, but never cleared, because
that currently happens only when the service is started on a different
node. The next commit will address the issue.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
Changes in v2:
* Rebase on current master.
* Split out introducing the test into a dedicated patch.
src/test/test-stale-maintenance-node/cmdlist | 6 ++
.../datacenter.cfg | 5 +
.../hardware_status | 5 +
.../test-stale-maintenance-node/log.expect | 92 +++++++++++++++++++
.../manager_status | 1 +
.../service_config | 3 +
6 files changed, 112 insertions(+)
create mode 100644 src/test/test-stale-maintenance-node/cmdlist
create mode 100644 src/test/test-stale-maintenance-node/datacenter.cfg
create mode 100644 src/test/test-stale-maintenance-node/hardware_status
create mode 100644 src/test/test-stale-maintenance-node/log.expect
create mode 100644 src/test/test-stale-maintenance-node/manager_status
create mode 100644 src/test/test-stale-maintenance-node/service_config
diff --git a/src/test/test-stale-maintenance-node/cmdlist b/src/test/test-stale-maintenance-node/cmdlist
new file mode 100644
index 0000000..8e4ed64
--- /dev/null
+++ b/src/test/test-stale-maintenance-node/cmdlist
@@ -0,0 +1,6 @@
+[
+ [ "power node1 on", "power node2 on", "power node3 on"],
+ [ "shutdown node1", "shutdown node2", "shutdown node3"],
+ [ "power node1 on", "power node2 on", "power node3 on"],
+ [ "service vm:103 migrate node3" ]
+]
diff --git a/src/test/test-stale-maintenance-node/datacenter.cfg b/src/test/test-stale-maintenance-node/datacenter.cfg
new file mode 100644
index 0000000..de0bf81
--- /dev/null
+++ b/src/test/test-stale-maintenance-node/datacenter.cfg
@@ -0,0 +1,5 @@
+{
+ "ha": {
+ "shutdown_policy": "migrate"
+ }
+}
diff --git a/src/test/test-stale-maintenance-node/hardware_status b/src/test/test-stale-maintenance-node/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-stale-maintenance-node/hardware_status
@@ -0,0 +1,5 @@
+{
+ "node1": { "power": "off", "network": "off" },
+ "node2": { "power": "off", "network": "off" },
+ "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-stale-maintenance-node/log.expect b/src/test/test-stale-maintenance-node/log.expect
new file mode 100644
index 0000000..ac944ca
--- /dev/null
+++ b/src/test/test-stale-maintenance-node/log.expect
@@ -0,0 +1,92 @@
+info 0 hardware: starting simulation
+info 20 cmdlist: execute power node1 on
+info 20 node1/crm: status change startup => wait_for_quorum
+info 20 node1/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node2 on
+info 20 node2/crm: status change startup => wait_for_quorum
+info 20 node2/lrm: status change startup => wait_for_agent_lock
+info 20 cmdlist: execute power node3 on
+info 20 node3/crm: status change startup => wait_for_quorum
+info 20 node3/lrm: status change startup => wait_for_agent_lock
+info 20 node1/crm: got lock 'ha_manager_lock'
+info 20 node1/crm: status change wait_for_quorum => master
+info 20 node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info 20 node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info 20 node1/crm: adding new service 'vm:103' on node 'node1'
+info 20 node1/crm: service 'vm:103': state changed from 'request_start' to 'started' (node = node1)
+info 21 node1/lrm: got lock 'ha_agent_node1_lock'
+info 21 node1/lrm: status change wait_for_agent_lock => active
+info 21 node1/lrm: starting service vm:103
+info 21 node1/lrm: service status vm:103 started
+info 22 node2/crm: status change wait_for_quorum => slave
+info 24 node3/crm: status change wait_for_quorum => slave
+info 120 cmdlist: execute shutdown node1
+info 120 node1/lrm: got shutdown request with shutdown policy 'migrate'
+info 120 node1/lrm: shutdown LRM, doing maintenance, removing this node from active list
+info 120 cmdlist: execute shutdown node2
+info 120 node2/lrm: got shutdown request with shutdown policy 'migrate'
+info 120 node2/lrm: shutdown LRM, doing maintenance, removing this node from active list
+info 120 cmdlist: execute shutdown node3
+info 120 node3/lrm: got shutdown request with shutdown policy 'migrate'
+info 120 node3/lrm: shutdown LRM, doing maintenance, removing this node from active list
+info 120 node1/crm: node 'node1': state changed from 'online' => 'maintenance'
+info 120 node1/crm: node 'node2': state changed from 'online' => 'maintenance'
+info 120 node1/crm: node 'node3': state changed from 'online' => 'maintenance'
+info 121 node1/lrm: status change active => maintenance
+info 124 node2/lrm: exit (loop end)
+info 124 shutdown: execute crm node2 stop
+info 123 node2/crm: server received shutdown request
+info 126 node3/lrm: exit (loop end)
+info 126 shutdown: execute crm node3 stop
+info 125 node3/crm: server received shutdown request
+info 143 node2/crm: exit (loop end)
+info 143 shutdown: execute power node2 off
+info 144 node3/crm: exit (loop end)
+info 144 shutdown: execute power node3 off
+info 160 node1/crm: status change master => lost_manager_lock
+info 160 node1/crm: status change lost_manager_lock => wait_for_quorum
+info 161 node1/lrm: status change maintenance => lost_agent_lock
+err 161 node1/lrm: get shutdown request in state 'lost_agent_lock' - detected 1 running services
+err 181 node1/lrm: get shutdown request in state 'lost_agent_lock' - detected 1 running services
+err 201 node1/lrm: get shutdown request in state 'lost_agent_lock' - detected 1 running services
+info 202 watchdog: execute power node1 off
+info 201 node1/crm: killed by poweroff
+info 202 node1/lrm: killed by poweroff
+info 202 hardware: server 'node1' stopped by poweroff (watchdog)
+info 220 cmdlist: execute power node1 on
+info 220 node1/crm: status change startup => wait_for_quorum
+info 220 node1/lrm: status change startup => wait_for_agent_lock
+info 220 cmdlist: execute power node2 on
+info 220 node2/crm: status change startup => wait_for_quorum
+info 220 node2/lrm: status change startup => wait_for_agent_lock
+info 220 cmdlist: execute power node3 on
+info 220 node3/crm: status change startup => wait_for_quorum
+info 220 node3/lrm: status change startup => wait_for_agent_lock
+info 220 node1/crm: status change wait_for_quorum => master
+info 221 node1/lrm: status change wait_for_agent_lock => active
+info 221 node1/lrm: starting service vm:103
+info 221 node1/lrm: service status vm:103 started
+info 222 node2/crm: status change wait_for_quorum => slave
+info 224 node3/crm: status change wait_for_quorum => slave
+info 240 node1/crm: node 'node1': state changed from 'maintenance' => 'online'
+info 240 node1/crm: node 'node2': state changed from 'maintenance' => 'online'
+info 240 node1/crm: node 'node3': state changed from 'maintenance' => 'online'
+info 320 cmdlist: execute service vm:103 migrate node3
+info 320 node1/crm: got crm command: migrate vm:103 node3
+info 320 node1/crm: migrate service 'vm:103' to node 'node3'
+info 320 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node1, target = node3)
+info 321 node1/lrm: service vm:103 - start migrate to node 'node3'
+info 321 node1/lrm: service vm:103 - end migrate to node 'node3'
+info 340 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node3)
+info 340 node1/crm: moving service 'vm:103' back to 'node1', node came back from maintenance.
+info 340 node1/crm: migrate service 'vm:103' to node 'node1' (running)
+info 340 node1/crm: service 'vm:103': state changed from 'started' to 'migrate' (node = node3, target = node1)
+info 345 node3/lrm: got lock 'ha_agent_node3_lock'
+info 345 node3/lrm: status change wait_for_agent_lock => active
+info 345 node3/lrm: service vm:103 - start migrate to node 'node1'
+info 345 node3/lrm: service vm:103 - end migrate to node 'node1'
+info 360 node1/crm: service 'vm:103': state changed from 'migrate' to 'started' (node = node1)
+info 361 node1/lrm: starting service vm:103
+info 361 node1/lrm: service status vm:103 started
+info 920 hardware: exit simulation - done
diff --git a/src/test/test-stale-maintenance-node/manager_status b/src/test/test-stale-maintenance-node/manager_status
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/src/test/test-stale-maintenance-node/manager_status
@@ -0,0 +1 @@
+{}
diff --git a/src/test/test-stale-maintenance-node/service_config b/src/test/test-stale-maintenance-node/service_config
new file mode 100644
index 0000000..cfed86f
--- /dev/null
+++ b/src/test/test-stale-maintenance-node/service_config
@@ -0,0 +1,3 @@
+{
+ "vm:103": { "node": "node1", "state": "enabled" }
+}
--
2.39.2
next reply other threads:[~2023-06-12 15:27 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-12 15:27 Fiona Ebner [this message]
2023-06-12 15:27 ` [pve-devel] [PATCH v2 ha-manager 2/2] manager: clear stale maintenance " Fiona Ebner
2023-06-13 7:00 ` [pve-devel] applied: " Thomas Lamprecht
2023-06-13 7:00 ` [pve-devel] applied: [PATCH v2 ha-manager 1/2] tests: simulate stale maintainance " Thomas Lamprecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230612152711.97078-1-f.ebner@proxmox.com \
--to=f.ebner@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox