From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id CC3EB1FF137 for ; Tue, 17 Mar 2026 13:35:59 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 08D99CB72; Tue, 17 Mar 2026 13:36:13 +0100 (CET) Content-Type: text/plain; charset=UTF-8 Date: Tue, 17 Mar 2026 13:36:08 +0100 Message-Id: Subject: Re: [PATCH ha-manager 2/3] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance From: "Dominik Rusovac" To: "Thomas Lamprecht" , Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 X-Mailer: aerc 0.20.0 References: <20260309220128.973793-1-t.lamprecht@proxmox.com> <20260309220128.973793-3-t.lamprecht@proxmox.com> In-Reply-To: <20260309220128.973793-3-t.lamprecht@proxmox.com> X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1773750927211 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.353 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: PTW6LI5QK6OUFI6RD5IBHXFEYW3HSJ5M X-Message-ID-Hash: PTW6LI5QK6OUFI6RD5IBHXFEYW3HSJ5M X-MailFrom: d.rusovac@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Please see my comments inline. On Mon Mar 9, 2026 at 10:57 PM CET, Thomas Lamprecht wrote: [snip] > More test ideas welcome, but testing on real clusters would be actually > even better ;-) In test src/test/test-disarm-maintenance1/ with cmdlist [ [ "power node1 on", "power node2 on", "power node3 on"], [ "crm node3 enable-node-maintenance" ], [ "crm node1 disarm-ha freeze" ], [ "crm node1 arm-ha" ], [ "crm node3 disable-node-maintenance" ] ] the services that have been migrated away from node3 (due to the enabled node-maintenance) are moved back to node3 upon re-arming HA and disabling node-maintenance, which is fine. However, if HA was disarmed with resource mode 'ignore' instead of 'freeze', the services are not moved back upon re-arming HA and disabling node-maintenance.=20 It might be a good idea to: - prohibit disarming HA with 'ignore' if at least one node is in maintenance mode (see [0]); and - add a test correponding to src/test/test-disarm-maintenance1/ but with cmdlist: [ [ "power node1 on", "power node2 on", "power node3 on"], [ "crm node3 enable-node-maintenance" ], [ "crm node1 disarm-ha ignore" ], [ "crm node3 disable-node-maintenance" ], [ "crm node1 disarm-ha ignore" ], [ "crm node1 arm-ha" ] ] where the expectation is that:=20 - the command to disarm HA with 'ignore' will be ignored;=20 - the CRM logs on warn-level that maintenance on node3 ought to be disabled first; and - after disabling node-maintenance on node3, the command to disarm HA with 'ignore' will no longer be ignored. > [snip] > +++ b/src/PVE/HA/Manager.pm > @@ -75,6 +75,9 @@ sub new { > # on change of active master. > $self->{ms}->{node_request} =3D $old_ms->{node_request} if defined($= old_ms->{node_request}); > =20 > + # preserve disarm state across CRM master changes > + $self->{ms}->{disarm} =3D $old_ms->{disarm} if defined($old_ms->{dis= arm}); > + > $self->update_crs_scheduler_mode(); # initial set, we update it once= every loop > =20 > return $self; > @@ -472,7 +475,12 @@ sub update_crm_commands { > my $node =3D $1; > =20 > my $state =3D $ns->get_node_state($node); > - if ($state eq 'online') { > + if ($ms->{disarm}) { > + $haenv->log( > + 'warn', > + "ignoring maintenance command for node $node - HA st= ack is disarmed", > + ); > + } elsif ($state eq 'online') { > $ms->{node_request}->{$node}->{maintenance} =3D 1; > } elsif ($state eq 'maintenance') { > $haenv->log( > @@ -493,6 +501,25 @@ sub update_crm_commands { > ); > } > delete $ms->{node_request}->{$node}->{maintenance}; # gets f= lushed out at the end of the CRM loop > + } elsif ($cmd =3D~ m/^disarm-ha\s+(freeze|ignore)$/) { > + my $mode =3D $1; > + > + if ($ms->{disarm}) { > + $haenv->log( > + 'warn', > + "ignoring disarm-ha command - already in disarm stat= e ($ms->{disarm}->{state})", > + ); > + } else { > + $haenv->log('info', "got crm command: disarm-ha $mode"); [0] e.g., here adding something along the lines of: if ($mode eq 'ignore') { for my $node (keys %{ $ms->{node_request} }) { if ($ms->{node_request}->{$node}->{maintenance})= { $haenv->log( 'warn', "ignoring disarm-ha command - " . "disable maintenance mode on node = '$node' first", ); return; } } } > + $ms->{disarm} =3D { mode =3D> $mode, state =3D> 'disarmi= ng' }; > + } > + } elsif ($cmd =3D~ m/^arm-ha$/) { [snip] > + > sub manage { > my ($self) =3D @_; > =20 > @@ -657,8 +765,12 @@ sub manage { > =20 > # compute new service status > =20 > + # skip service add/remove when disarmed - handle_disarm manages serv= ice status > + my $is_disarmed =3D $ms->{disarm}; > + > # add new service > foreach my $sid (sort keys %$sc) { > + next if $is_disarmed; > next if $ss->{$sid}; # already there > my $cd =3D $sc->{$sid}; > next if $cd->{state} eq 'ignored'; > @@ -675,6 +787,7 @@ sub manage { > =20 > # remove stale or ignored services from manager state > foreach my $sid (keys %$ss) { > + next if $is_disarmed; > next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored'; > =20 > my $reason =3D defined($sc->{$sid}) ? 'ignored state requested' = : 'no config'; nit: Instead of checking for each $sid whether $is_disarmed, a single guard could be used to skip service add/remove when disarmed, e.g.: # skip service add/remove when disarmed - handle_disarm manages serv= ice status if (!$ms->{disarm}) { foreach my $sid (sort keys %$sc) { next if $ss->{$sid}; # already there my $cd =3D $sc->{$sid}; next if $cd->{state} eq 'ignored'; # ... } =20 # remove stale or ignored services from manager state foreach my $sid (keys %$ss) { next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored'; =20 my $reason =3D defined($sc->{$sid}) ? 'ignored state requeste= d' : 'no config'; # ... } } > @@ -713,6 +826,15 @@ sub manage { [snip] > diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm > index 8cbf48d..b4f1d88 100644 > --- a/src/PVE/HA/Sim/Hardware.pm > +++ b/src/PVE/HA/Sim/Hardware.pm > @@ -835,6 +835,10 @@ sub sim_hardware_cmd { > || $action eq 'disable-node-maintenance' > ) { > $self->queue_crm_commands_nolock("$action $node"); > + } elsif ($action eq 'disarm-ha') { > + $self->queue_crm_commands_nolock("disarm-ha $params[0]")= ; nit: to increase readability and usability my $mode =3D $params[0]; die "sim_hardware_cmd: unknown resource mode '$mode'" if $mode !~ m/^(freeze|ignore)$/; $self->queue_crm_commands_nolock("disarm-ha $mode"); > + } elsif ($action eq 'arm-ha') { > + $self->queue_crm_commands_nolock("arm-ha"); > } else { > die "sim_hardware_cmd: unknown action '$action'"; > } [snip]