From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: Daniel Kral <d.kral@proxmox.com>, pve-devel@lists.proxmox.com
Subject: Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
Date: Fri, 27 Mar 2026 00:15:43 +0100 [thread overview]
Message-ID: <deb611b2-fd77-4410-8a35-4983ec965879@proxmox.com> (raw)
In-Reply-To: <DHCU9R824P2D.38DIPJF7JXIZ9@proxmox.com>
Am 26.03.26 um 17:02 schrieb Daniel Kral:
> Nice work!
>
> While rebasing the load balancing patches on this, I had a few thoughts
> about the transient states, which I wanted to add below. Nothing really
> urgent and I might have missed something so correct me if I'm wrong.
>
>
> One small UX nit:
>
> As the test case 'test-disarm-relocate1' documents, the 'disarm-ha'
> command takes priority over the migrate/relocate command. It is more of
> an edge case anyway that both a migration an disarm-ha command are
> issued in the same HA Manager cycle, but of course could still happen.
>
> Handling the disarming as an admin task is more important than a
> user-initiated migration request. It would still be great that this
> action is relayed back to the user for the migration request (which
> might only see the hamigrate task but no following qmigrate task), but
> this is more an UX thing and should be handled e.g. as part of #6220 [0]
> so users can directly follow the HA request and the further actions.
yeah, UX could be slightly better, but as you hint, it's really a
pre-existing issue for a few cases already, so slightly orthogonal
to this.
>
> [0] https://bugzilla.proxmox.com/show_bug.cgi?id=6220
>
> On Sun Mar 22, 2026 at 12:42 AM CET, Thomas Lamprecht wrote:
>> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
>> index b1dbe6a..aa29858 100644
>> --- a/src/PVE/HA/Manager.pm
>> +++ b/src/PVE/HA/Manager.pm
>> @@ -631,6 +684,94 @@ sub try_persistent_group_migration {
>> }
>> }
>>
>> +sub handle_disarm {
>> + my ($self, $disarm, $ss, $lrm_modes) = @_;
>> +
>> + my $haenv = $self->{haenv};
>> + my $ns = $self->{ns};
>> +
>> + # defer disarm if any services are in a transient state that needs the state machine to resolve
>> + for my $sid (sort keys %$ss) {
>> + my $state = $ss->{$sid}->{state};
>> + if ($state eq 'fence' || $state eq 'recovery') {
>> + $haenv->log(
>> + 'warn', "deferring disarm - service '$sid' is in '$state' state",
>> + );
>> + return 0; # let manage() continue so fence/recovery can progress
>> + }
>> + if ($state eq 'migrate' || $state eq 'relocate') {
>> + $haenv->log(
>> + 'info', "deferring disarm - service '$sid' is in '$state' state",
>> + );
>> + return 0; # let manage() continue so migration can complete
>> + }
>> + }
>
> Here the HA disarming is deferred so that the HA Manager can continue
> with processing the HA resources FSM if at least one of the HA resources
> is in one of the 4 transient states.
>
> This can have side-effects for other HA resources as well, which
> currently aren't in one of these 4 transient states, but are implicitly
> in a transient state.
>
>
> For example:
>
> There are two HA resources in the current service state.
>
> {
> ...
> "service_status": {
> "vm:101": {
> "state": "started",
> "node": "node1",
> "cmd": [ "migrate", "node2" ],
> "uid": "...",
> },
> "vm:102": {
> "state": "...",
> "node": "node1",
> "uid": "...",
> },
> }
> }
>
>
> If vm:102 is in state 'started', then the HA Manager will start to
> disarm the HA stack right away and drop vm:101's migration request in
> both disarm modes.
>
> If vm:102 is in any of the 4 transient states from above though, then
> the HA Manager will defer the disarm and process all HA resources, which
> will cause 'vm:101' to be put in 'migrate' now.
>
>
> So the transient state of one HA resource might cause other HA resources
> to be put in a transient state as well, even though I would have
> expected the deferring here to only resolve the existing transient
> states.
>
> This also means that the HA Manager will handle user requests for HA
> resources in the same HA Manager cycle as a disarm request if there is
> at least one HA resource in one of the 4 transient states. This
> contradicts that disarming takes priority over the migrate/relocate
> request.
good find!
> Two other possible cases of 'implicitly transient' states might be:
>
> - adding a HA rule, which makes a HA resource in 'started' state be put
> in 'migrate' to another node when processing the select_service_node()
> in next_state_started().
>
> - the node of a HA resource is offline delayed in the same round as the
> disarm request. If none of the HA resources are in a transient state
> yet, the disarm request goes through, otherwise the affected HA
> resources might be put in 'fence'.
>
>
> I haven't thought this through fully, but an option might be that we
> only allow the FSM processing of the HA resources, which are in one of
> these 4 transient states and don't process the others.
>
> E.g. breaking out the FSM transition loop into its own function and in
> normal operation we iterate through all services in $ss, but for
> deferred disarming we only iterate through the HA resources in transient
> states, which should be resolved.
I pushed a follow-up [0] that should deal with this, another look at
that would be appreciated!
FWIW, I mostly pushed directly as I still wanted to do a bump+test upload
today, because if everything is good we get a small regression fix faster
out to users, if not we can follow-up here and no harm done.
[0]: https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=b6b025a268032ff5302bede1f5eb56247af13f21
[...]
>> + if (my $disarm = $ms->{disarm}) {
>> + if ($self->handle_disarm($disarm, $ss, $lrm_modes)) {
>> + return; # disarm active and progressing, skip normal service state machine
>> + }
>> + # disarm deferred (e.g. due to active fencing) - fall through to let it complete
>> + }
>
> Regarding the load balancing patches, I would change this to
>
> - }
> + } else {
> + # load balance only if disarm is disabled as during a deferred disarm
> + # the HA Manager should not introduce any new migrations
> + $self->load_balance();
> + }
>
> there.
Yeah, that's also what I did for the merge conflict when applying it your series
for review.
>> +
>> + $self->{all_lrms_disarmed} = 0;
>> +
>> for (;;) {
>> my $repeat = 0;
>
>
>
>
next prev parent reply other threads:[~2026-03-26 23:15 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer " Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 1/4] sim: hardware: add manual-migrate command for ignored services Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 2/4] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
2026-03-23 13:04 ` Dominik Rusovac
2026-03-25 15:50 ` Fiona Ebner
2026-03-27 1:17 ` Thomas Lamprecht
2026-03-26 16:02 ` Daniel Kral
2026-03-26 23:15 ` Thomas Lamprecht [this message]
2026-03-27 10:21 ` Daniel Kral
2026-03-21 23:42 ` [PATCH ha-manager v2 4/4] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
2026-03-23 13:05 ` [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac
2026-03-25 12:06 ` applied: " Thomas Lamprecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=deb611b2-fd77-4410-8a35-4983ec965879@proxmox.com \
--to=t.lamprecht@proxmox.com \
--cc=d.kral@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.