public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: Daniel Kral <d.kral@proxmox.com>, pve-devel@lists.proxmox.com
Subject: Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
Date: Fri, 27 Mar 2026 00:15:43 +0100	[thread overview]
Message-ID: <deb611b2-fd77-4410-8a35-4983ec965879@proxmox.com> (raw)
In-Reply-To: <DHCU9R824P2D.38DIPJF7JXIZ9@proxmox.com>

Am 26.03.26 um 17:02 schrieb Daniel Kral:
> Nice work!
> 
> While rebasing the load balancing patches on this, I had a few thoughts
> about the transient states, which I wanted to add below. Nothing really
> urgent and I might have missed something so correct me if I'm wrong.
> 
> 
> One small UX nit:
> 
> As the test case 'test-disarm-relocate1' documents, the 'disarm-ha'
> command takes priority over the migrate/relocate command. It is more of
> an edge case anyway that both a migration an disarm-ha command are
> issued in the same HA Manager cycle, but of course could still happen.
> 
> Handling the disarming as an admin task is more important than a
> user-initiated migration request. It would still be great that this
> action is relayed back to the user for the migration request (which
> might only see the hamigrate task but no following qmigrate task), but
> this is more an UX thing and should be handled e.g. as part of #6220 [0]
> so users can directly follow the HA request and the further actions.

yeah, UX could be slightly better, but as you hint, it's really a
pre-existing issue for a few cases already, so slightly orthogonal
to this.

> 
> [0] https://bugzilla.proxmox.com/show_bug.cgi?id=6220
> 
> On Sun Mar 22, 2026 at 12:42 AM CET, Thomas Lamprecht wrote:
>> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
>> index b1dbe6a..aa29858 100644
>> --- a/src/PVE/HA/Manager.pm
>> +++ b/src/PVE/HA/Manager.pm
>> @@ -631,6 +684,94 @@ sub try_persistent_group_migration {
>>      }
>>  }
>>  
>> +sub handle_disarm {
>> +    my ($self, $disarm, $ss, $lrm_modes) = @_;
>> +
>> +    my $haenv = $self->{haenv};
>> +    my $ns = $self->{ns};
>> +
>> +    # defer disarm if any services are in a transient state that needs the state machine to resolve
>> +    for my $sid (sort keys %$ss) {
>> +        my $state = $ss->{$sid}->{state};
>> +        if ($state eq 'fence' || $state eq 'recovery') {
>> +            $haenv->log(
>> +                'warn', "deferring disarm - service '$sid' is in '$state' state",
>> +            );
>> +            return 0; # let manage() continue so fence/recovery can progress
>> +        }
>> +        if ($state eq 'migrate' || $state eq 'relocate') {
>> +            $haenv->log(
>> +                'info', "deferring disarm - service '$sid' is in '$state' state",
>> +            );
>> +            return 0; # let manage() continue so migration can complete
>> +        }
>> +    }
> 
> Here the HA disarming is deferred so that the HA Manager can continue
> with processing the HA resources FSM if at least one of the HA resources
> is in one of the 4 transient states.
> 
> This can have side-effects for other HA resources as well, which
> currently aren't in one of these 4 transient states, but are implicitly
> in a transient state.
> 
> 
> For example:
> 
> There are two HA resources in the current service state.
> 
> {
>    ...
>    "service_status": {
>        "vm:101": {
>            "state": "started",
>            "node": "node1",
>            "cmd": [ "migrate", "node2" ],
>            "uid": "...",
>        },
>        "vm:102": {
>            "state": "...",
>            "node": "node1",
>            "uid": "...",
>        },
>    }
> }
> 
> 
> If vm:102 is in state 'started', then the HA Manager will start to
> disarm the HA stack right away and drop vm:101's migration request in
> both disarm modes.
> 
> If vm:102 is in any of the 4 transient states from above though, then
> the HA Manager will defer the disarm and process all HA resources, which
> will cause 'vm:101' to be put in 'migrate' now.
> 
> 
> So the transient state of one HA resource might cause other HA resources
> to be put in a transient state as well, even though I would have
> expected the deferring here to only resolve the existing transient
> states.
> 
> This also means that the HA Manager will handle user requests for HA
> resources in the same HA Manager cycle as a disarm request if there is
> at least one HA resource in one of the 4 transient states. This
> contradicts that disarming takes priority over the migrate/relocate
> request.

good find!

> Two other possible cases of 'implicitly transient' states might be:
> 
> - adding a HA rule, which makes a HA resource in 'started' state be put
>   in 'migrate' to another node when processing the select_service_node()
>   in next_state_started().
> 
> - the node of a HA resource is offline delayed in the same round as the
>   disarm request. If none of the HA resources are in a transient state
>   yet, the disarm request goes through, otherwise the affected HA
>   resources might be put in 'fence'.
> 
> 
> I haven't thought this through fully, but an option might be that we
> only allow the FSM processing of the HA resources, which are in one of
> these 4 transient states and don't process the others.
> 
> E.g. breaking out the FSM transition loop into its own function and in
> normal operation we iterate through all services in $ss, but for
> deferred disarming we only iterate through the HA resources in transient
> states, which should be resolved.

I pushed a follow-up [0] that should deal with this, another look at
that would be appreciated!
FWIW, I mostly pushed directly as I still wanted to do a bump+test upload
today, because if everything is good we get a small regression fix faster
out to users, if not we can follow-up here and no harm done.

[0]: https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=b6b025a268032ff5302bede1f5eb56247af13f21

[...]

>> +    if (my $disarm = $ms->{disarm}) {
>> +        if ($self->handle_disarm($disarm, $ss, $lrm_modes)) {
>> +            return; # disarm active and progressing, skip normal service state machine
>> +        }
>> +        # disarm deferred (e.g. due to active fencing) - fall through to let it complete
>> +    }
> 
> Regarding the load balancing patches, I would change this to
> 
> - }
> + } else {
> +     # load balance only if disarm is disabled as during a deferred disarm
> +     # the HA Manager should not introduce any new migrations
> +     $self->load_balance();
> + }
> 
> there.

Yeah, that's also what I did for the merge conflict when applying it your series
for review.

>> +
>> +    $self->{all_lrms_disarmed} = 0;
>> +
>>      for (;;) {
>>          my $repeat = 0;
> 
> 
> 
> 





  reply	other threads:[~2026-03-26 23:15 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer " Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 1/4] sim: hardware: add manual-migrate command for ignored services Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 2/4] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
2026-03-23 13:04   ` Dominik Rusovac
2026-03-25 15:50   ` Fiona Ebner
2026-03-27  1:17     ` Thomas Lamprecht
2026-03-26 16:02   ` Daniel Kral
2026-03-26 23:15     ` Thomas Lamprecht [this message]
2026-03-27 10:21       ` Daniel Kral
2026-03-21 23:42 ` [PATCH ha-manager v2 4/4] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
2026-03-23 13:05 ` [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac
2026-03-25 12:06 ` applied: " Thomas Lamprecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=deb611b2-fd77-4410-8a35-4983ec965879@proxmox.com \
    --to=t.lamprecht@proxmox.com \
    --cc=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal