From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 3CD1793F3 for ; Thu, 17 Nov 2022 15:01:25 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 3D4FC2D679 for ; Thu, 17 Nov 2022 15:00:31 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Thu, 17 Nov 2022 15:00:26 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C0F7144D8B for ; Thu, 17 Nov 2022 15:00:24 +0100 (CET) From: Fiona Ebner To: pve-devel@lists.proxmox.com Date: Thu, 17 Nov 2022 15:00:01 +0100 Message-Id: <20221117140018.105004-1-f.ebner@proxmox.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: =?UTF-8?Q?0=0A=09?=AWL 0.027 Adjusted score from AWL reputation of From: =?UTF-8?Q?address=0A=09?=BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict =?UTF-8?Q?Alignment=0A=09?=SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF =?UTF-8?Q?Record=0A=09?=SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] [PATCH-SERIES v2 ha-manager/docs] add static usage scheduler for HA manager X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Nov 2022 14:01:25 -0000 Right now, the online node usage calculation for the HA manager only considers the number of active services on each node. This patch series allows switching to a 'static' scheduler mode instead, where static usage information from the nodes and guest configurations is used instead. With this version, the effect is limited to choosing nodes during recovery or by migrations triggered by a shutdown plolicy, but the plan is to extend this in the future. As a next step, it would be nice to also have for startup, but AFAICT the issue is that the node selection only happens after the state is already set to started and I think select_service_node() doesn't currently know if a service has been newly started. I haven't looked into it in too much detail though. An idea to get a balancer out of it, is to: 1. (optionally) sort all services by badness (needs new backend function) 2. iterate scoring the nodes for each service, adding the usage to the chosen node after each iteration. The current node can be kept if the score compared to the best node doesn't differ too much. 3. record the chosen nodes and migrate the services accordingly. The online node usage calculation is factored out into a 'Usage' plugin system to ease adding the new static mode without much cluttering. If not all nodes provide static service information, we fall back to the 'basic' mode. If only the scoring fails, the service count is used as a fallback. Dependency bumps needed: proxmox-ha-manager (build)depends on proxmox-perl-rs The new feature is only usable with updated pve-manager and pve-cluster of course, but no hard dependency. Changes from v1: * Drop already applied patches. * Add tests for HA manager which also required properly adding relevant methods to the simulation environment. * Implement fallback for scoring in Usage/Static.pm. * Improve documentation and mention current limitation with many services. ha-manager: Fiona Ebner (15): env: add get_static_node_stats() method resources: add get_static_stats() method add Usage base plugin and Usage::Basic plugin manager: select service node: add $sid to parameters manager: online node usage: switch to Usage::Basic plugin usage: add Usage::Static plugin env: rename get_ha_settings to get_datacenter_settings env: datacenter config: include crs (cluster-resource-scheduling) setting manager: set resource scheduler mode upon init manager: use static resource scheduler when configured manager: avoid scoring nodes if maintenance fallback node is valid manager: avoid scoring nodes when not trying next and current node is valid usage: static: use service count on nodes as a fallback test: add tests for static resource scheduling resources: add missing PVE::Cluster use statements debian/pve-ha-manager.install | 3 + src/PVE/HA/Env.pm | 10 +- src/PVE/HA/Env/PVE2.pm | 27 ++- src/PVE/HA/LRM.pm | 4 +- src/PVE/HA/Makefile | 3 +- src/PVE/HA/Manager.pm | 79 +++++--- src/PVE/HA/Resources.pm | 5 + src/PVE/HA/Resources/PVECT.pm | 13 ++ src/PVE/HA/Resources/PVEVM.pm | 16 ++ src/PVE/HA/Sim/Env.pm | 13 +- src/PVE/HA/Sim/Hardware.pm | 28 +++ src/PVE/HA/Sim/Resources.pm | 10 + src/PVE/HA/Usage.pm | 50 +++++ src/PVE/HA/Usage/Basic.pm | 52 ++++++ src/PVE/HA/Usage/Makefile | 6 + src/PVE/HA/Usage/Static.pm | 120 ++++++++++++ src/test/test-crs-static1/README | 4 + src/test/test-crs-static1/cmdlist | 4 + src/test/test-crs-static1/datacenter.cfg | 6 + src/test/test-crs-static1/hardware_status | 5 + src/test/test-crs-static1/log.expect | 50 +++++ src/test/test-crs-static1/manager_status | 1 + src/test/test-crs-static1/service_config | 3 + .../test-crs-static1/static_service_stats | 3 + src/test/test-crs-static2/README | 4 + src/test/test-crs-static2/cmdlist | 20 ++ src/test/test-crs-static2/datacenter.cfg | 6 + src/test/test-crs-static2/groups | 2 + src/test/test-crs-static2/hardware_status | 7 + src/test/test-crs-static2/log.expect | 171 ++++++++++++++++++ src/test/test-crs-static2/manager_status | 1 + src/test/test-crs-static2/service_config | 3 + .../test-crs-static2/static_service_stats | 3 + src/test/test-crs-static3/README | 5 + src/test/test-crs-static3/cmdlist | 4 + src/test/test-crs-static3/datacenter.cfg | 9 + src/test/test-crs-static3/hardware_status | 5 + src/test/test-crs-static3/log.expect | 131 ++++++++++++++ src/test/test-crs-static3/manager_status | 1 + src/test/test-crs-static3/service_config | 12 ++ .../test-crs-static3/static_service_stats | 12 ++ src/test/test-crs-static4/README | 6 + src/test/test-crs-static4/cmdlist | 4 + src/test/test-crs-static4/datacenter.cfg | 9 + src/test/test-crs-static4/hardware_status | 5 + src/test/test-crs-static4/log.expect | 149 +++++++++++++++ src/test/test-crs-static4/manager_status | 1 + src/test/test-crs-static4/service_config | 12 ++ .../test-crs-static4/static_service_stats | 12 ++ src/test/test-crs-static5/README | 5 + src/test/test-crs-static5/cmdlist | 4 + src/test/test-crs-static5/datacenter.cfg | 9 + src/test/test-crs-static5/hardware_status | 5 + src/test/test-crs-static5/log.expect | 117 ++++++++++++ src/test/test-crs-static5/manager_status | 1 + src/test/test-crs-static5/service_config | 10 + .../test-crs-static5/static_service_stats | 11 ++ src/test/test_failover1.pl | 21 ++- 58 files changed, 1242 insertions(+), 50 deletions(-) create mode 100644 src/PVE/HA/Usage.pm create mode 100644 src/PVE/HA/Usage/Basic.pm create mode 100644 src/PVE/HA/Usage/Makefile create mode 100644 src/PVE/HA/Usage/Static.pm create mode 100644 src/test/test-crs-static1/README create mode 100644 src/test/test-crs-static1/cmdlist create mode 100644 src/test/test-crs-static1/datacenter.cfg create mode 100644 src/test/test-crs-static1/hardware_status create mode 100644 src/test/test-crs-static1/log.expect create mode 100644 src/test/test-crs-static1/manager_status create mode 100644 src/test/test-crs-static1/service_config create mode 100644 src/test/test-crs-static1/static_service_stats create mode 100644 src/test/test-crs-static2/README create mode 100644 src/test/test-crs-static2/cmdlist create mode 100644 src/test/test-crs-static2/datacenter.cfg create mode 100644 src/test/test-crs-static2/groups create mode 100644 src/test/test-crs-static2/hardware_status create mode 100644 src/test/test-crs-static2/log.expect create mode 100644 src/test/test-crs-static2/manager_status create mode 100644 src/test/test-crs-static2/service_config create mode 100644 src/test/test-crs-static2/static_service_stats create mode 100644 src/test/test-crs-static3/README create mode 100644 src/test/test-crs-static3/cmdlist create mode 100644 src/test/test-crs-static3/datacenter.cfg create mode 100644 src/test/test-crs-static3/hardware_status create mode 100644 src/test/test-crs-static3/log.expect create mode 100644 src/test/test-crs-static3/manager_status create mode 100644 src/test/test-crs-static3/service_config create mode 100644 src/test/test-crs-static3/static_service_stats create mode 100644 src/test/test-crs-static4/README create mode 100644 src/test/test-crs-static4/cmdlist create mode 100644 src/test/test-crs-static4/datacenter.cfg create mode 100644 src/test/test-crs-static4/hardware_status create mode 100644 src/test/test-crs-static4/log.expect create mode 100644 src/test/test-crs-static4/manager_status create mode 100644 src/test/test-crs-static4/service_config create mode 100644 src/test/test-crs-static4/static_service_stats create mode 100644 src/test/test-crs-static5/README create mode 100644 src/test/test-crs-static5/cmdlist create mode 100644 src/test/test-crs-static5/datacenter.cfg create mode 100644 src/test/test-crs-static5/hardware_status create mode 100644 src/test/test-crs-static5/log.expect create mode 100644 src/test/test-crs-static5/manager_status create mode 100644 src/test/test-crs-static5/service_config create mode 100644 src/test/test-crs-static5/static_service_stats docs: Fiona Ebner (2): ha: add section about scheduler modes ha: add warning against using 'static' mode with many services ha-manager.adoc | 49 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) -- 2.30.2