From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 51A301FF13F for ; Fri, 27 Mar 2026 00:15:30 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 63C851E605; Fri, 27 Mar 2026 00:15:50 +0100 (CET) Message-ID: Date: Fri, 27 Mar 2026 00:15:43 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Beta Subject: Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance To: Daniel Kral , pve-devel@lists.proxmox.com References: <20260321234350.2158438-1-t.lamprecht@proxmox.com> <20260321234350.2158438-4-t.lamprecht@proxmox.com> Content-Language: en-US From: Thomas Lamprecht In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1774566894406 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.011 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: J4OZJIVXOBTTOGHI7ZS5FTEKWLCZT4U6 X-Message-ID-Hash: J4OZJIVXOBTTOGHI7ZS5FTEKWLCZT4U6 X-MailFrom: t.lamprecht@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Am 26.03.26 um 17:02 schrieb Daniel Kral: > Nice work! > > While rebasing the load balancing patches on this, I had a few thoughts > about the transient states, which I wanted to add below. Nothing really > urgent and I might have missed something so correct me if I'm wrong. > > > One small UX nit: > > As the test case 'test-disarm-relocate1' documents, the 'disarm-ha' > command takes priority over the migrate/relocate command. It is more of > an edge case anyway that both a migration an disarm-ha command are > issued in the same HA Manager cycle, but of course could still happen. > > Handling the disarming as an admin task is more important than a > user-initiated migration request. It would still be great that this > action is relayed back to the user for the migration request (which > might only see the hamigrate task but no following qmigrate task), but > this is more an UX thing and should be handled e.g. as part of #6220 [0] > so users can directly follow the HA request and the further actions. yeah, UX could be slightly better, but as you hint, it's really a pre-existing issue for a few cases already, so slightly orthogonal to this. > > [0] https://bugzilla.proxmox.com/show_bug.cgi?id=6220 > > On Sun Mar 22, 2026 at 12:42 AM CET, Thomas Lamprecht wrote: >> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm >> index b1dbe6a..aa29858 100644 >> --- a/src/PVE/HA/Manager.pm >> +++ b/src/PVE/HA/Manager.pm >> @@ -631,6 +684,94 @@ sub try_persistent_group_migration { >> } >> } >> >> +sub handle_disarm { >> + my ($self, $disarm, $ss, $lrm_modes) = @_; >> + >> + my $haenv = $self->{haenv}; >> + my $ns = $self->{ns}; >> + >> + # defer disarm if any services are in a transient state that needs the state machine to resolve >> + for my $sid (sort keys %$ss) { >> + my $state = $ss->{$sid}->{state}; >> + if ($state eq 'fence' || $state eq 'recovery') { >> + $haenv->log( >> + 'warn', "deferring disarm - service '$sid' is in '$state' state", >> + ); >> + return 0; # let manage() continue so fence/recovery can progress >> + } >> + if ($state eq 'migrate' || $state eq 'relocate') { >> + $haenv->log( >> + 'info', "deferring disarm - service '$sid' is in '$state' state", >> + ); >> + return 0; # let manage() continue so migration can complete >> + } >> + } > > Here the HA disarming is deferred so that the HA Manager can continue > with processing the HA resources FSM if at least one of the HA resources > is in one of the 4 transient states. > > This can have side-effects for other HA resources as well, which > currently aren't in one of these 4 transient states, but are implicitly > in a transient state. > > > For example: > > There are two HA resources in the current service state. > > { > ... > "service_status": { > "vm:101": { > "state": "started", > "node": "node1", > "cmd": [ "migrate", "node2" ], > "uid": "...", > }, > "vm:102": { > "state": "...", > "node": "node1", > "uid": "...", > }, > } > } > > > If vm:102 is in state 'started', then the HA Manager will start to > disarm the HA stack right away and drop vm:101's migration request in > both disarm modes. > > If vm:102 is in any of the 4 transient states from above though, then > the HA Manager will defer the disarm and process all HA resources, which > will cause 'vm:101' to be put in 'migrate' now. > > > So the transient state of one HA resource might cause other HA resources > to be put in a transient state as well, even though I would have > expected the deferring here to only resolve the existing transient > states. > > This also means that the HA Manager will handle user requests for HA > resources in the same HA Manager cycle as a disarm request if there is > at least one HA resource in one of the 4 transient states. This > contradicts that disarming takes priority over the migrate/relocate > request. good find! > Two other possible cases of 'implicitly transient' states might be: > > - adding a HA rule, which makes a HA resource in 'started' state be put > in 'migrate' to another node when processing the select_service_node() > in next_state_started(). > > - the node of a HA resource is offline delayed in the same round as the > disarm request. If none of the HA resources are in a transient state > yet, the disarm request goes through, otherwise the affected HA > resources might be put in 'fence'. > > > I haven't thought this through fully, but an option might be that we > only allow the FSM processing of the HA resources, which are in one of > these 4 transient states and don't process the others. > > E.g. breaking out the FSM transition loop into its own function and in > normal operation we iterate through all services in $ss, but for > deferred disarming we only iterate through the HA resources in transient > states, which should be resolved. I pushed a follow-up [0] that should deal with this, another look at that would be appreciated! FWIW, I mostly pushed directly as I still wanted to do a bump+test upload today, because if everything is good we get a small regression fix faster out to users, if not we can follow-up here and no harm done. [0]: https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=b6b025a268032ff5302bede1f5eb56247af13f21 [...] >> + if (my $disarm = $ms->{disarm}) { >> + if ($self->handle_disarm($disarm, $ss, $lrm_modes)) { >> + return; # disarm active and progressing, skip normal service state machine >> + } >> + # disarm deferred (e.g. due to active fencing) - fall through to let it complete >> + } > > Regarding the load balancing patches, I would change this to > > - } > + } else { > + # load balance only if disarm is disabled as during a deferred disarm > + # the HA Manager should not introduce any new migrations > + $self->load_balance(); > + } > > there. Yeah, that's also what I did for the merge conflict when applying it your series for review. >> + >> + $self->{all_lrms_disarmed} = 0; >> + >> for (;;) { >> my $repeat = 0; > > > >