public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Dominik Rusovac" <d.rusovac@proxmox.com>
To: "Thomas Lamprecht" <t.lamprecht@proxmox.com>,
	<pve-devel@lists.proxmox.com>
Subject: Re: [PATCH ha-manager 2/3] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
Date: Tue, 17 Mar 2026 13:36:08 +0100	[thread overview]
Message-ID: <DH528PN7O7LQ.3518UFY2X4TJR@proxmox.com> (raw)
In-Reply-To: <20260309220128.973793-3-t.lamprecht@proxmox.com>

Please see my comments inline.

On Mon Mar 9, 2026 at 10:57 PM CET, Thomas Lamprecht wrote:

[snip]

> More test ideas welcome, but testing on real clusters would be actually
> even better ;-)

In test src/test/test-disarm-maintenance1/ with cmdlist

    [
        [ "power node1 on", "power node2 on", "power node3 on"],
        [ "crm node3 enable-node-maintenance" ],
        [ "crm node1 disarm-ha freeze" ],
        [ "crm node1 arm-ha" ],
        [ "crm node3 disable-node-maintenance" ]
    ]

the services that have been migrated away from node3 (due to the enabled
node-maintenance) are moved back to node3 upon re-arming HA and
disabling node-maintenance, which is fine.

However, if HA was disarmed with resource mode 'ignore' instead of
'freeze', the services are not moved back upon re-arming HA and
disabling node-maintenance. 

It might be a good idea to:

  - prohibit disarming HA with 'ignore' if at least one node is in
    maintenance mode (see [0]); and
  - add a test correponding to src/test/test-disarm-maintenance1/ but
    with cmdlist:

    [
        [ "power node1 on", "power node2 on", "power node3 on"],
        [ "crm node3 enable-node-maintenance" ],
        [ "crm node1 disarm-ha ignore" ],
        [ "crm node3 disable-node-maintenance" ],
        [ "crm node1 disarm-ha ignore" ],
        [ "crm node1 arm-ha" ]
    ]

where the expectation is that: 

  - the command to disarm HA with 'ignore' will be ignored; 
  - the CRM logs on warn-level that maintenance on node3 ought to be
    disabled first; and
  - after disabling node-maintenance on node3, the command to disarm HA
    with 'ignore' will no longer be ignored.

>

[snip]

> +++ b/src/PVE/HA/Manager.pm
> @@ -75,6 +75,9 @@ sub new {
>      # on change of active master.
>      $self->{ms}->{node_request} = $old_ms->{node_request} if defined($old_ms->{node_request});
>  
> +    # preserve disarm state across CRM master changes
> +    $self->{ms}->{disarm} = $old_ms->{disarm} if defined($old_ms->{disarm});
> +
>      $self->update_crs_scheduler_mode(); # initial set, we update it once every loop
>  
>      return $self;
> @@ -472,7 +475,12 @@ sub update_crm_commands {
>              my $node = $1;
>  
>              my $state = $ns->get_node_state($node);
> -            if ($state eq 'online') {
> +            if ($ms->{disarm}) {
> +                $haenv->log(
> +                    'warn',
> +                    "ignoring maintenance command for node $node - HA stack is disarmed",
> +                );
> +            } elsif ($state eq 'online') {
>                  $ms->{node_request}->{$node}->{maintenance} = 1;
>              } elsif ($state eq 'maintenance') {
>                  $haenv->log(
> @@ -493,6 +501,25 @@ sub update_crm_commands {
>                  );
>              }
>              delete $ms->{node_request}->{$node}->{maintenance}; # gets flushed out at the end of the CRM loop
> +        } elsif ($cmd =~ m/^disarm-ha\s+(freeze|ignore)$/) {
> +            my $mode = $1;
> +
> +            if ($ms->{disarm}) {
> +                $haenv->log(
> +                    'warn',
> +                    "ignoring disarm-ha command - already in disarm state ($ms->{disarm}->{state})",
> +                );
> +            } else {
> +                $haenv->log('info', "got crm command: disarm-ha $mode");

[0] e.g., here adding something along the lines of:

                   if ($mode eq 'ignore') {
                       for my $node (keys %{ $ms->{node_request} }) {
                           if ($ms->{node_request}->{$node}->{maintenance}) {
                               $haenv->log(
                                   'warn',
                                   "ignoring disarm-ha command - "
                                       . "disable maintenance mode on node '$node' first",
                               );
                               return;
                           }
                       }
                   }

> +                $ms->{disarm} = { mode => $mode, state => 'disarming' };
> +            }
> +        } elsif ($cmd =~ m/^arm-ha$/) {

[snip]

> +
>  sub manage {
>      my ($self) = @_;
>  
> @@ -657,8 +765,12 @@ sub manage {
>  
>      # compute new service status
>  
> +    # skip service add/remove when disarmed - handle_disarm manages service status
> +    my $is_disarmed = $ms->{disarm};
> +
>      # add new service
>      foreach my $sid (sort keys %$sc) {
> +        next if $is_disarmed;
>          next if $ss->{$sid}; # already there
>          my $cd = $sc->{$sid};
>          next if $cd->{state} eq 'ignored';
> @@ -675,6 +787,7 @@ sub manage {
>  
>      # remove stale or ignored services from manager state
>      foreach my $sid (keys %$ss) {
> +        next if $is_disarmed;
>          next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
>  
>          my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';

nit: Instead of checking for each $sid whether $is_disarmed, a single
guard could be used to skip service add/remove when disarmed, e.g.:

       # skip service add/remove when disarmed - handle_disarm manages service status
       if (!$ms->{disarm}) {
          foreach my $sid (sort keys %$sc) {
              next if $ss->{$sid}; # already there
              my $cd = $sc->{$sid};
              next if $cd->{state} eq 'ignored';
              # ...
	  }
      
          # remove stale or ignored services from manager state
          foreach my $sid (keys %$ss) {
              next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
      
              my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
	      # ...
	  }
       }

> @@ -713,6 +826,15 @@ sub manage {

[snip]

> diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
> index 8cbf48d..b4f1d88 100644
> --- a/src/PVE/HA/Sim/Hardware.pm
> +++ b/src/PVE/HA/Sim/Hardware.pm
> @@ -835,6 +835,10 @@ sub sim_hardware_cmd {
>                  || $action eq 'disable-node-maintenance'
>              ) {
>                  $self->queue_crm_commands_nolock("$action $node");
> +            } elsif ($action eq 'disarm-ha') {
> +                $self->queue_crm_commands_nolock("disarm-ha $params[0]");

nit: to increase readability and usability

                   my $mode = $params[0];

                   die "sim_hardware_cmd: unknown resource mode '$mode'"
                       if $mode !~ m/^(freeze|ignore)$/;

                   $self->queue_crm_commands_nolock("disarm-ha $mode");

> +            } elsif ($action eq 'arm-ha') {
> +                $self->queue_crm_commands_nolock("arm-ha");
>              } else {
>                  die "sim_hardware_cmd: unknown action '$action'";
>              }

[snip]




  reply	other threads:[~2026-03-17 12:35 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 21:57 [PATCH ha-manager 0/3] fix #2751: implement disarm/arm HA for safer " Thomas Lamprecht
2026-03-09 21:57 ` [PATCH ha-manager 1/3] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
2026-03-09 21:57 ` [PATCH ha-manager 2/3] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
2026-03-17 12:36   ` Dominik Rusovac [this message]
2026-03-09 21:57 ` [PATCH ha-manager 3/3] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
2026-03-17 12:47   ` Dominik Rusovac
2026-03-17 12:35 ` [PATCH ha-manager 0/3] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DH528PN7O7LQ.3518UFY2X4TJR@proxmox.com \
    --to=d.rusovac@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal