From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 2474E1FF185 for ; Mon, 4 Aug 2025 17:23:09 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6FC6535798; Mon, 4 Aug 2025 17:24:37 +0200 (CEST) Mime-Version: 1.0 Date: Mon, 04 Aug 2025 17:24:33 +0200 Message-Id: From: "Hannes Duerr" To: "Proxmox VE development discussion" Cc: "pve-devel" X-Mailer: aerc 0.20.0 References: <20250804141204.207216-1-d.kral@proxmox.com> <20250804141204.207216-5-d.kral@proxmox.com> In-Reply-To: <20250804141204.207216-5-d.kral@proxmox.com> X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1754321054818 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.073 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH docs 5/5] ha: replace in-text references to ha groups with ha rules X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE development discussion Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" On Mon Aug 4, 2025 at 4:11 PM CEST, Daniel Kral wrote: > As HA groups are replaced by HA node affinity rules and user can > implement new CRS behavior with HA resource affinity rules now, update > texts that reference HA groups with references to HA rules instead. > > While at it, also replace references to "HA services" with "HA > resources" for short sections that are touched in the process as new > references should use the latter term only. > > Signed-off-by: Daniel Kral > --- > ha-manager.adoc | 49 +++++++++++++++++++++++++------------------------ > 1 file changed, 25 insertions(+), 24 deletions(-) > > diff --git a/ha-manager.adoc b/ha-manager.adoc > index ffab83c..f63fd05 100644 > --- a/ha-manager.adoc > +++ b/ha-manager.adoc > @@ -314,9 +314,8 @@ recovery state. > recovery:: > > Wait for recovery of the service. The HA manager tries to find a new node where forgot to change the `service` here? > -the service can run on. This search depends not only on the list of online and > -quorate nodes, but also if the service is a group member and how such a group > -is limited. > +the service can run on. This search depends on the list of online and quorate s/service/resource/ > +nodes as well as the affinity rules the service is part of, if any. s/service/resource/ > As soon as a new available node is found, the service will be moved there and forgot to change the `service` here? > initially placed into stopped state. If it's configured to run the new node > will do so. > @@ -977,20 +976,24 @@ Recover Fenced Services > ~~~~~~~~~~~~~~~~~~~~~~~ > > After a node failed and its fencing was successful, the CRM tries to > -move services from the failed node to nodes which are still online. > +move HA resources from the failed node to nodes which are still online. > > -The selection of nodes, on which those services gets recovered, is > -influenced by the resource `group` settings, the list of currently active > -nodes, and their respective active service count. > +The selection of the recovery nodes is influenced by the list of > +currently active nodes, their respective loads depending on the used > +scheduler, and the affinity rules the resource is part of, if any. > > -The CRM first builds a set out of the intersection between user selected > -nodes (from `group` setting) and available nodes. It then choose the > -subset of nodes with the highest priority, and finally select the node > -with the lowest active service count. This minimizes the possibility > +First, the CRM builds a set of nodes available to the HA resource. If the > +resource is part of a node affinity rule, the set is reduced to the > +highest priority nodes in the node affinity rule. If the resource is part > +of a resource affinity rule, the set is further reduced to fufill their > +constraints, which is either keeping the HA resource on the same node as > +some other HA resources or keeping the HA resource on a different node > +than some other HA resources. Finally, the CRM selects the node with the > +lowest load according to the used scheduler to minimize the possibility > of an overloaded node. > > -CAUTION: On node failure, the CRM distributes services to the > -remaining nodes. This increases the service count on those nodes, and > +CAUTION: On node failure, the CRM distributes resources to the > +remaining nodes. This increases the resource count on those nodes, and > can lead to high load, especially on small clusters. Please design > your cluster so that it can handle such worst case scenarios. > > @@ -1102,7 +1105,7 @@ You can use the manual maintenance mode to mark the node as unavailable for HA > operation, prompting all services managed by HA to migrate to other nodes. forgot to change the `service` here? > > The target nodes for these migrations are selected from the other currently > -available nodes, and determined by the HA group configuration and the configured > +available nodes, and determined by the HA rules configuration and the configured > cluster resource scheduler (CRS) mode. > During each migration, the original node will be recorded in the HA managers' > state, so that the service can be moved back again automatically once the forgot to change the `service` here? > @@ -1173,14 +1176,12 @@ This triggers a migration of all HA Services currently located on this node. forgot to change the `service` here? > The LRM will try to delay the shutdown process, until all running services get forgot to change the `service` here? > moved away. But, this expects that the running services *can* be migrated to forgot to change the `service` here? > another node. In other words, the service must not be locally bound, for example forgot to change the `service` here? > -by using hardware passthrough. As non-group member nodes are considered as > -runnable target if no group member is available, this policy can still be used > -when making use of HA groups with only some nodes selected. But, marking a group > -as 'restricted' tells the HA manager that the service cannot run outside of the > -chosen set of nodes. If all of those nodes are unavailable, the shutdown will > -hang until you manually intervene. Once the shut down node comes back online > -again, the previously displaced services will be moved back, if they were not > -already manually migrated in-between. > +by using hardware passthrough. For example, strict node affinity rules tell the s/For example, s/S/ > +HA Manager that the service cannot run outside of the chosen set of nodes. If all > +of those nodes are unavailable, the shutdown will hang until you manually s/those/these/ > +intervene. Once the shut down node comes back online again, the previously > +displaced services will be moved back, if they were not already manually migrated > +in-between. > > NOTE: The watchdog is still active during the migration process on shutdown. > If the node loses quorum it will be fenced and the services will be recovered. > @@ -1266,8 +1267,8 @@ The change will be in effect starting with the next manager round (after a few > seconds). > > For each service that needs to be recovered or migrated, the scheduler > -iteratively chooses the best node among the nodes with the highest priority in > -the service's group. > +iteratively chooses the best node among the nodes that are available to > +the service according to their HA rules, if any. Doesn't the scheduler take the ha node affinity priority into consideration here? And: s/service/resource/ > > NOTE: There are plans to add modes for (static and dynamic) load-balancing in > the future. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel