From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: Daniel Kral <d.kral@proxmox.com>, pve-devel@lists.proxmox.com
Subject: Re: [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
Date: Fri, 27 Mar 2026 00:15:43 +0100 [thread overview]
Message-ID: <deb611b2-fd77-4410-8a35-4983ec965879@proxmox.com> (raw)
In-Reply-To: <DHCU9R824P2D.38DIPJF7JXIZ9@proxmox.com>
Am 26.03.26 um 17:02 schrieb Daniel Kral:
> Nice work!
>
> While rebasing the load balancing patches on this, I had a few thoughts
> about the transient states, which I wanted to add below. Nothing really
> urgent and I might have missed something so correct me if I'm wrong.
>
>
> One small UX nit:
>
> As the test case 'test-disarm-relocate1' documents, the 'disarm-ha'
> command takes priority over the migrate/relocate command. It is more of
> an edge case anyway that both a migration an disarm-ha command are
> issued in the same HA Manager cycle, but of course could still happen.
>
> Handling the disarming as an admin task is more important than a
> user-initiated migration request. It would still be great that this
> action is relayed back to the user for the migration request (which
> might only see the hamigrate task but no following qmigrate task), but
> this is more an UX thing and should be handled e.g. as part of #6220 [0]
> so users can directly follow the HA request and the further actions.
yeah, UX could be slightly better, but as you hint, it's really a
pre-existing issue for a few cases already, so slightly orthogonal
to this.
>
> [0] https://bugzilla.proxmox.com/show_bug.cgi?id=6220
>
> On Sun Mar 22, 2026 at 12:42 AM CET, Thomas Lamprecht wrote:
>> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
>> index b1dbe6a..aa29858 100644
>> --- a/src/PVE/HA/Manager.pm
>> +++ b/src/PVE/HA/Manager.pm
>> @@ -631,6 +684,94 @@ sub try_persistent_group_migration {
>> }
>> }
>>
>> +sub handle_disarm {
>> + my ($self, $disarm, $ss, $lrm_modes) = @_;
>> +
>> + my $haenv = $self->{haenv};
>> + my $ns = $self->{ns};
>> +
>> + # defer disarm if any services are in a transient state that needs the state machine to resolve
>> + for my $sid (sort keys %$ss) {
>> + my $state = $ss->{$sid}->{state};
>> + if ($state eq 'fence' || $state eq 'recovery') {
>> + $haenv->log(
>> + 'warn', "deferring disarm - service '$sid' is in '$state' state",
>> + );
>> + return 0; # let manage() continue so fence/recovery can progress
>> + }
>> + if ($state eq 'migrate' || $state eq 'relocate') {
>> + $haenv->log(
>> + 'info', "deferring disarm - service '$sid' is in '$state' state",
>> + );
>> + return 0; # let manage() continue so migration can complete
>> + }
>> + }
>
> Here the HA disarming is deferred so that the HA Manager can continue
> with processing the HA resources FSM if at least one of the HA resources
> is in one of the 4 transient states.
>
> This can have side-effects for other HA resources as well, which
> currently aren't in one of these 4 transient states, but are implicitly
> in a transient state.
>
>
> For example:
>
> There are two HA resources in the current service state.
>
> {
> ...
> "service_status": {
> "vm:101": {
> "state": "started",
> "node": "node1",
> "cmd": [ "migrate", "node2" ],
> "uid": "...",
> },
> "vm:102": {
> "state": "...",
> "node": "node1",
> "uid": "...",
> },
> }
> }
>
>
> If vm:102 is in state 'started', then the HA Manager will start to
> disarm the HA stack right away and drop vm:101's migration request in
> both disarm modes.
>
> If vm:102 is in any of the 4 transient states from above though, then
> the HA Manager will defer the disarm and process all HA resources, which
> will cause 'vm:101' to be put in 'migrate' now.
>
>
> So the transient state of one HA resource might cause other HA resources
> to be put in a transient state as well, even though I would have
> expected the deferring here to only resolve the existing transient
> states.
>
> This also means that the HA Manager will handle user requests for HA
> resources in the same HA Manager cycle as a disarm request if there is
> at least one HA resource in one of the 4 transient states. This
> contradicts that disarming takes priority over the migrate/relocate
> request.
good find!
> Two other possible cases of 'implicitly transient' states might be:
>
> - adding a HA rule, which makes a HA resource in 'started' state be put
> in 'migrate' to another node when processing the select_service_node()
> in next_state_started().
>
> - the node of a HA resource is offline delayed in the same round as the
> disarm request. If none of the HA resources are in a transient state
> yet, the disarm request goes through, otherwise the affected HA
> resources might be put in 'fence'.
>
>
> I haven't thought this through fully, but an option might be that we
> only allow the FSM processing of the HA resources, which are in one of
> these 4 transient states and don't process the others.
>
> E.g. breaking out the FSM transition loop into its own function and in
> normal operation we iterate through all services in $ss, but for
> deferred disarming we only iterate through the HA resources in transient
> states, which should be resolved.
I pushed a follow-up [0] that should deal with this, another look at
that would be appreciated!
FWIW, I mostly pushed directly as I still wanted to do a bump+test upload
today, because if everything is good we get a small regression fix faster
out to users, if not we can follow-up here and no harm done.
[0]: https://git.proxmox.com/?p=pve-ha-manager.git;a=commitdiff;h=b6b025a268032ff5302bede1f5eb56247af13f21
[...]
>> + if (my $disarm = $ms->{disarm}) {
>> + if ($self->handle_disarm($disarm, $ss, $lrm_modes)) {
>> + return; # disarm active and progressing, skip normal service state machine
>> + }
>> + # disarm deferred (e.g. due to active fencing) - fall through to let it complete
>> + }
>
> Regarding the load balancing patches, I would change this to
>
> - }
> + } else {
> + # load balance only if disarm is disabled as during a deferred disarm
> + # the HA Manager should not introduce any new migrations
> + $self->load_balance();
> + }
>
> there.
Yeah, that's also what I did for the merge conflict when applying it your series
for review.
>> +
>> + $self->{all_lrms_disarmed} = 0;
>> +
>> for (;;) {
>> my $repeat = 0;
>
>
>
>
next prev parent reply other threads:[~2026-03-26 23:15 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-21 23:42 [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer " Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 1/4] sim: hardware: add manual-migrate command for ignored services Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 2/4] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
2026-03-21 23:42 ` [PATCH ha-manager v2 3/4] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
2026-03-23 13:04 ` Dominik Rusovac
2026-03-25 15:50 ` Fiona Ebner
2026-03-27 1:17 ` Thomas Lamprecht
2026-03-26 16:02 ` Daniel Kral
2026-03-26 23:15 ` Thomas Lamprecht [this message]
2026-03-27 10:21 ` Daniel Kral
2026-03-21 23:42 ` [PATCH ha-manager v2 4/4] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
2026-03-23 13:05 ` [PATCH ha-manager v2 0/4] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac
2026-03-25 12:06 ` applied: " Thomas Lamprecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=deb611b2-fd77-4410-8a35-4983ec965879@proxmox.com \
--to=t.lamprecht@proxmox.com \
--cc=d.kral@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox