From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 3DA791FF141 for ; Tue, 19 May 2026 18:01:08 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id D66F3D48D; Tue, 19 May 2026 18:01:07 +0200 (CEST) Message-ID: <5978c036-c864-413c-a4a7-6febe1b7f2b3@proxmox.com> Date: Tue, 19 May 2026 18:00:31 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed To: Daniel Kral , pve-devel@lists.proxmox.com References: <20260519143842.382324-1-d.kral@proxmox.com> <20260519143842.382324-3-d.kral@proxmox.com> Content-Language: en-US From: Fiona Ebner In-Reply-To: <20260519143842.382324-3-d.kral@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1779206417631 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.009 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: 5467KZQQIYIFOERV6UKSLO2VEVJZMHDY X-Message-ID-Hash: 5467KZQQIYIFOERV6UKSLO2VEVJZMHDY X-MailFrom: f.ebner@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Am 19.05.26 um 4:39 PM schrieb Daniel Kral: > If there are HA resources, which are in transient states that defer the > disarming process, but their LRMs are already in idle state and disarmed > mode, these LRMs will not properly resolve the transient states of these > HA resources as assumed by the HA Manager. > > For HA resources, which are still moving, this makes the HA Manager > stuck in a loop, which tries to defer the disarming process to wait for > a LRM response for these moving HA resources, which will never come as > the LRM is idle. > > Therefore allow the LRM to become active in disarm mode if there are any > HA resources on the LRM's node, which are in any of these transient > states, and make sure that the LRM only processes the disarm-deferring > HA resources while the LRM is active. > > Signed-off-by: Daniel Kral > --- > src/PVE/HA/LRM.pm | 19 ++++++++++- > src/PVE/HA/Manager.pm | 8 ++--- > src/PVE/HA/Tools.pm | 17 ++++++++++ > src/test/test-disarm-idle-lrm1/log.expect | 37 ++++++--------------- > src/test/test-disarm-idle-lrm2/log.expect | 39 +++++++---------------- > 5 files changed, 58 insertions(+), 62 deletions(-) > > diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm > index 426982cc..9100d611 100644 > --- a/src/PVE/HA/LRM.pm > +++ b/src/PVE/HA/LRM.pm > @@ -312,6 +312,18 @@ sub active_service_count { > return PVE::HA::Tools::count_active_services($ss, $nodename); > } > > +# returns a truthy value if there are HA resources in transient states, which > +# need to be resolved, e.g. to complete the disarm procedure. > +sub has_disarm_deferred_services { Nit: I feel like the variables and functions should rather be named disarm_deferring rather than disarm_deferred > + my ($self) = @_; > + > + my $ss = $self->{service_status}; > + my $nodename = $self->{haenv}->nodename(); > + my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename); > + > + return %$deferred_sids; > +} > + > my $wrote_lrm_status_at_startup = 0; > > sub do_one_iteration { > @@ -371,7 +383,7 @@ sub work { > > my $service_count = $self->active_service_count(); > > - if ($self->{mode} eq 'disarm') { > + if ($self->{mode} eq 'disarm' && !$self->has_disarm_deferred_services()) { > # stay idle while disarmed, don't acquire lock > } elsif (!$fence_request && $service_count && $haenv->quorate()) { > if ($self->get_protected_ha_agent_lock()) { > @@ -709,12 +721,17 @@ sub manage_resources { > my $nodename = $haenv->nodename(); > > my $ss = $self->{service_status}; > + my $deferred_sids; Nit: Here, a full $disarm_deferring_sids would provide the most context. > + $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss, $nodename) > + if $self->{mode} eq 'disarm'; > > foreach my $sid (keys %{ $self->{restart_tries} }) { > delete $self->{restart_tries}->{$sid} if !$ss->{$sid}; > } > > foreach my $sid (keys %$ss) { > + next if $deferred_sids && !$deferred_sids->{$sid}; > + > my $sd = $ss->{$sid}; > next if !$sd->{node} || !$sd->{uid}; > next if $sd->{node} ne $nodename; > diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm > index 9b901c4f..a2baf349 100644 > --- a/src/PVE/HA/Manager.pm > +++ b/src/PVE/HA/Manager.pm > @@ -929,15 +929,13 @@ sub handle_disarm { > } > > # defer disarm if any services are in a transient state that needs the state machine to resolve > - my $deferred_sids = {}; > - for my $sid (sort keys %$ss) { > + my $deferred_sids = PVE::HA::Tools::get_disarm_deferred_services($ss); > + for my $sid (sort keys %$deferred_sids) { > my $state = $ss->{$sid}->{state}; > if ($state eq 'fence' || $state eq 'recovery') { > $haenv->log('warning', "deferring disarm - service '$sid' is in '$state' state"); > - $deferred_sids->{$sid} = 1; > - } elsif ($state eq 'migrate' || $state eq 'relocate') { > + } else { > $haenv->log('info', "deferring disarm - service '$sid' is in '$state' state"); > - $deferred_sids->{$sid} = 1; > } > } > > diff --git a/src/PVE/HA/Tools.pm b/src/PVE/HA/Tools.pm > index 26629fb5..37b27e11 100644 > --- a/src/PVE/HA/Tools.pm > +++ b/src/PVE/HA/Tools.pm > @@ -213,6 +213,23 @@ sub count_active_services { > return $active_count; > } > > +sub get_disarm_deferred_services { > + my ($ss, $node) = @_; > + > + my $deferred_sids = {}; > + my @deferrable_states = qw(fence recovery migrate relocate); Nit: disarm_deferring_states > + > + for my $sid (keys %$ss) { > + my ($state, $current_node, $target_node) = $ss->{$sid}->@{qw(state node target)}; > + > + next if $node && (!$current_node || $current_node ne $node); Just wondering: when does !$current_node happen? > + > + $deferred_sids->{$sid} = 1 if grep { $state eq $_ } @deferrable_states; > + } > + > + return $deferred_sids; > +} > + > sub get_verbose_service_state { > my ($service_state, $service_conf) = @_; >