all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH ha-manager v2 21/26] manager: handle negative colocations with too many services
Date: Fri, 20 Jun 2025 16:31:33 +0200	[thread overview]
Message-ID: <20250620143148.218469-26-d.kral@proxmox.com> (raw)
In-Reply-To: <20250620143148.218469-1-d.kral@proxmox.com>

select_service_node(...) in 'none' mode will usually only return no
node, if negative colocations specify more services than nodes
available. In these cases, these cannot be separated as there are no
more nodes left, so these are put in error state for now.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
This is not ideal and I'd rather make this be dropped in the
check_feasibility(...) part, but then we'd need to introduce more state
to the check helpers or make a direct call to
PVE::Cluster::get_nodelist(...).

changes since v1:
    - NEW!

 src/PVE/HA/Manager.pm                         | 13 +++++
 .../test-colocation-strict-separate9/README   | 14 +++++
 .../test-colocation-strict-separate9/cmdlist  |  3 +
 .../hardware_status                           |  5 ++
 .../log.expect                                | 57 +++++++++++++++++++
 .../manager_status                            |  1 +
 .../rules_config                              |  3 +
 .../service_config                            |  7 +++
 8 files changed, 103 insertions(+)
 create mode 100644 src/test/test-colocation-strict-separate9/README
 create mode 100644 src/test/test-colocation-strict-separate9/cmdlist
 create mode 100644 src/test/test-colocation-strict-separate9/hardware_status
 create mode 100644 src/test/test-colocation-strict-separate9/log.expect
 create mode 100644 src/test/test-colocation-strict-separate9/manager_status
 create mode 100644 src/test/test-colocation-strict-separate9/rules_config
 create mode 100644 src/test/test-colocation-strict-separate9/service_config

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 66e5710..59b2998 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1092,6 +1092,19 @@ sub next_state_started {
                         );
                         delete $sd->{maintenance_node};
                     }
+                } elsif ($select_mode eq 'none' && !defined($node)) {
+                    # Having no node here means that the service is started but cannot find any
+                    # node it is allowed to run on, e.g. added negative colocation rule, while the
+                    # nodes aren't separated yet.
+                    # TODO Could be made impossible by a dynamic check to drop negative colocation
+                    #      rules which have defined more services than available nodes
+                    $haenv->log(
+                        'err',
+                        "service '$sid' cannot run on '$sd->{node}', but no recovery node found",
+                    );
+
+                    # TODO Should this really move the service to the error state?
+                    $change_service_state->($self, $sid, 'error');
                 }
 
                 # ensure service get started again if it went unexpected down
diff --git a/src/test/test-colocation-strict-separate9/README b/src/test/test-colocation-strict-separate9/README
new file mode 100644
index 0000000..85494dd
--- /dev/null
+++ b/src/test/test-colocation-strict-separate9/README
@@ -0,0 +1,14 @@
+Test whether a strict negative colocation rule among five services on a three
+node cluster, makes the services which are on the same node be put in error
+state as there are not enough nodes to separate all of them and it's also not
+clear which of the three is more important to run.
+
+The test scenario is:
+- vm:101 through vm:105 must be kept separate
+- vm:101 through vm:105 are all running on node1
+
+The expected outcome is:
+- As the cluster comes up, vm:102 and vm:103 are migrated to node2 and node3
+- vm:101, vm:104, and vm:105 will be put in error state as there are not enough
+  nodes left to separate them but it is also not clear which service is more
+  important to be run on the only node left.
diff --git a/src/test/test-colocation-strict-separate9/cmdlist b/src/test/test-colocation-strict-separate9/cmdlist
new file mode 100644
index 0000000..3bfad44
--- /dev/null
+++ b/src/test/test-colocation-strict-separate9/cmdlist
@@ -0,0 +1,3 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"]
+]
diff --git a/src/test/test-colocation-strict-separate9/hardware_status b/src/test/test-colocation-strict-separate9/hardware_status
new file mode 100644
index 0000000..451beb1
--- /dev/null
+++ b/src/test/test-colocation-strict-separate9/hardware_status
@@ -0,0 +1,5 @@
+{
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
diff --git a/src/test/test-colocation-strict-separate9/log.expect b/src/test/test-colocation-strict-separate9/log.expect
new file mode 100644
index 0000000..efe85a2
--- /dev/null
+++ b/src/test/test-colocation-strict-separate9/log.expect
@@ -0,0 +1,57 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node1'
+info     20    node1/crm: adding new service 'vm:103' on node 'node1'
+info     20    node1/crm: adding new service 'vm:104' on node 'node1'
+info     20    node1/crm: adding new service 'vm:105' on node 'node1'
+info     20    node1/crm: service 'vm:101': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:102': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:103': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:104': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: service 'vm:105': state changed from 'request_start' to 'started'  (node = node1)
+info     20    node1/crm: migrate service 'vm:101' to node 'node2' (running)
+info     20    node1/crm: service 'vm:101': state changed from 'started' to 'migrate'  (node = node1, target = node2)
+info     20    node1/crm: migrate service 'vm:102' to node 'node3' (running)
+info     20    node1/crm: service 'vm:102': state changed from 'started' to 'migrate'  (node = node1, target = node3)
+err      20    node1/crm: service 'vm:103' cannot run on 'node1', but no recovery node found
+info     20    node1/crm: service 'vm:103': state changed from 'started' to 'error'
+err      20    node1/crm: service 'vm:104' cannot run on 'node1', but no recovery node found
+info     20    node1/crm: service 'vm:104': state changed from 'started' to 'error'
+err      20    node1/crm: service 'vm:105' cannot run on 'node1', but no recovery node found
+info     20    node1/crm: service 'vm:105': state changed from 'started' to 'error'
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: service vm:101 - start migrate to node 'node2'
+info     21    node1/lrm: service vm:101 - end migrate to node 'node2'
+info     21    node1/lrm: service vm:102 - start migrate to node 'node3'
+info     21    node1/lrm: service vm:102 - end migrate to node 'node3'
+err      21    node1/lrm: service vm:103 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
+err      21    node1/lrm: service vm:104 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
+err      21    node1/lrm: service vm:105 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
+info     22    node2/crm: status change wait_for_quorum => slave
+info     24    node3/crm: status change wait_for_quorum => slave
+info     40    node1/crm: service 'vm:101': state changed from 'migrate' to 'started'  (node = node2)
+info     40    node1/crm: service 'vm:102': state changed from 'migrate' to 'started'  (node = node3)
+info     43    node2/lrm: got lock 'ha_agent_node2_lock'
+info     43    node2/lrm: status change wait_for_agent_lock => active
+info     43    node2/lrm: starting service vm:101
+info     43    node2/lrm: service status vm:101 started
+info     45    node3/lrm: got lock 'ha_agent_node3_lock'
+info     45    node3/lrm: status change wait_for_agent_lock => active
+info     45    node3/lrm: starting service vm:102
+info     45    node3/lrm: service status vm:102 started
+info    620     hardware: exit simulation - done
diff --git a/src/test/test-colocation-strict-separate9/manager_status b/src/test/test-colocation-strict-separate9/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-colocation-strict-separate9/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-colocation-strict-separate9/rules_config b/src/test/test-colocation-strict-separate9/rules_config
new file mode 100644
index 0000000..478d70b
--- /dev/null
+++ b/src/test/test-colocation-strict-separate9/rules_config
@@ -0,0 +1,3 @@
+colocation: lonely-must-too-many-vms-be
+	services vm:101,vm:102,vm:103,vm:104,vm:105
+	affinity separate
diff --git a/src/test/test-colocation-strict-separate9/service_config b/src/test/test-colocation-strict-separate9/service_config
new file mode 100644
index 0000000..a1d61f5
--- /dev/null
+++ b/src/test/test-colocation-strict-separate9/service_config
@@ -0,0 +1,7 @@
+{
+    "vm:101": { "node": "node1", "state": "started" },
+    "vm:102": { "node": "node1", "state": "started" },
+    "vm:103": { "node": "node1", "state": "started" },
+    "vm:104": { "node": "node1", "state": "started" },
+    "vm:105": { "node": "node1", "state": "started" }
+}
-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


  parent reply	other threads:[~2025-06-20 14:35 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-20 14:31 [pve-devel] [RFC common/cluster/ha-manager/docs/manager v2 00/40] HA colocation rules Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH common v2 1/1] introduce HashTools module Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH cluster v2 1/3] cfs: add 'ha/rules.cfg' to observed files Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH cluster v2 2/3] datacenter config: make pve-ha-shutdown-policy optional Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH cluster v2 3/3] datacenter config: introduce feature flag for location rules Daniel Kral
2025-06-23 15:58   ` Thomas Lamprecht
2025-06-24  7:29     ` Daniel Kral
2025-06-24  7:51       ` Thomas Lamprecht
2025-06-24  8:19         ` Daniel Kral
2025-06-24  8:25           ` Thomas Lamprecht
2025-06-24  8:52             ` Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 01/26] tree-wide: make arguments for select_service_node explicit Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 02/26] manager: improve signature of select_service_node Daniel Kral
2025-06-23 16:21   ` Thomas Lamprecht
2025-06-24  8:06     ` Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 03/26] introduce rules base plugin Daniel Kral
2025-07-04 14:18   ` Michael Köppl
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 04/26] rules: introduce location rule plugin Daniel Kral
2025-06-20 16:17   ` Jillian Morgan
2025-06-20 16:30     ` Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 05/26] rules: introduce colocation " Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 06/26] rules: add global checks between location and colocation rules Daniel Kral
2025-07-01 11:02   ` Daniel Kral
2025-07-04 14:43   ` Michael Köppl
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 07/26] config, env, hw: add rules read and parse methods Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 08/26] manager: read and update rules config Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 09/26] test: ha tester: add test cases for future location rules Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 10/26] resources: introduce failback property in service config Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 11/26] manager: migrate ha groups to location rules in-memory Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 12/26] manager: apply location rules when selecting service nodes Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 13/26] usage: add information about a service's assigned nodes Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 14/26] manager: apply colocation rules when selecting service nodes Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 15/26] manager: handle migrations for colocated services Daniel Kral
2025-06-27  9:10   ` Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 16/26] sim: resources: add option to limit start and migrate tries to node Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 17/26] test: ha tester: add test cases for strict negative colocation rules Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 18/26] test: ha tester: add test cases for strict positive " Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 19/26] test: ha tester: add test cases in more complex scenarios Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 20/26] test: add test cases for rules config Daniel Kral
2025-06-20 14:31 ` Daniel Kral [this message]
2025-07-01 12:11   ` [pve-devel] [PATCH ha-manager v2 21/26] manager: handle negative colocations with too many services Michael Köppl
2025-07-01 12:23     ` Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 22/26] config: prune services from rules if services are deleted from config Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 23/26] api: introduce ha rules api endpoints Daniel Kral
2025-07-04 14:16   ` Michael Köppl
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 24/26] cli: expose ha rules api endpoints to ha-manager cli Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 25/26] api: groups, services: assert use-location-rules feature flag Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH ha-manager v2 26/26] api: services: check for colocations for service motions Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH docs v2 1/5] ha: config: add section about ha rules Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH docs v2 2/5] update static files to include ha rules api endpoints Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH docs v2 3/5] update static files to include use-location-rules feature flag Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH docs v2 4/5] update static files to include ha resources failback flag Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH docs v2 5/5] update static files to include ha service motion return value schema Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH manager v2 1/5] api: ha: add ha rules api endpoints Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH manager v2 2/5] ui: add use-location-rules feature flag Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH manager v2 3/5] ui: ha: hide ha groups if use-location-rules is enabled Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH manager v2 4/5] ui: ha: adapt resources components " Daniel Kral
2025-06-20 14:31 ` [pve-devel] [PATCH manager v2 5/5] ui: ha: add ha rules components and menu entry Daniel Kral
2025-06-30 15:09   ` Michael Köppl
2025-07-01 14:38   ` Michael Köppl
2025-06-20 15:43 ` [pve-devel] [RFC common/cluster/ha-manager/docs/manager v2 00/40] HA colocation rules Daniel Kral
2025-06-20 17:11   ` Jillian Morgan
2025-06-20 17:45     ` DERUMIER, Alexandre via pve-devel
     [not found]     ` <476c41123dced9d560dfbf27640ef8705fd90f11.camel@groupe-cyllene.com>
2025-06-23 15:36       ` Thomas Lamprecht
2025-06-24  8:48         ` Daniel Kral
2025-06-27 12:23           ` Friedrich Weber
2025-06-27 12:41             ` Daniel Kral
2025-06-23  8:11 ` DERUMIER, Alexandre via pve-devel
     [not found] ` <bf973ec4e8c52a10535ed35ad64bf0ec8d1ad37d.camel@groupe-cyllene.com>
2025-06-23 15:28   ` Thomas Lamprecht
2025-06-23 23:21     ` DERUMIER, Alexandre via pve-devel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250620143148.218469-26-d.kral@proxmox.com \
    --to=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal