From: "Dominik Rusovac" <d.rusovac@proxmox.com>
To: "Thomas Lamprecht" <t.lamprecht@proxmox.com>,
<pve-devel@lists.proxmox.com>
Subject: Re: [PATCH ha-manager 2/3] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
Date: Tue, 17 Mar 2026 13:36:08 +0100 [thread overview]
Message-ID: <DH528PN7O7LQ.3518UFY2X4TJR@proxmox.com> (raw)
In-Reply-To: <20260309220128.973793-3-t.lamprecht@proxmox.com>
Please see my comments inline.
On Mon Mar 9, 2026 at 10:57 PM CET, Thomas Lamprecht wrote:
[snip]
> More test ideas welcome, but testing on real clusters would be actually
> even better ;-)
In test src/test/test-disarm-maintenance1/ with cmdlist
[
[ "power node1 on", "power node2 on", "power node3 on"],
[ "crm node3 enable-node-maintenance" ],
[ "crm node1 disarm-ha freeze" ],
[ "crm node1 arm-ha" ],
[ "crm node3 disable-node-maintenance" ]
]
the services that have been migrated away from node3 (due to the enabled
node-maintenance) are moved back to node3 upon re-arming HA and
disabling node-maintenance, which is fine.
However, if HA was disarmed with resource mode 'ignore' instead of
'freeze', the services are not moved back upon re-arming HA and
disabling node-maintenance.
It might be a good idea to:
- prohibit disarming HA with 'ignore' if at least one node is in
maintenance mode (see [0]); and
- add a test correponding to src/test/test-disarm-maintenance1/ but
with cmdlist:
[
[ "power node1 on", "power node2 on", "power node3 on"],
[ "crm node3 enable-node-maintenance" ],
[ "crm node1 disarm-ha ignore" ],
[ "crm node3 disable-node-maintenance" ],
[ "crm node1 disarm-ha ignore" ],
[ "crm node1 arm-ha" ]
]
where the expectation is that:
- the command to disarm HA with 'ignore' will be ignored;
- the CRM logs on warn-level that maintenance on node3 ought to be
disabled first; and
- after disabling node-maintenance on node3, the command to disarm HA
with 'ignore' will no longer be ignored.
>
[snip]
> +++ b/src/PVE/HA/Manager.pm
> @@ -75,6 +75,9 @@ sub new {
> # on change of active master.
> $self->{ms}->{node_request} = $old_ms->{node_request} if defined($old_ms->{node_request});
>
> + # preserve disarm state across CRM master changes
> + $self->{ms}->{disarm} = $old_ms->{disarm} if defined($old_ms->{disarm});
> +
> $self->update_crs_scheduler_mode(); # initial set, we update it once every loop
>
> return $self;
> @@ -472,7 +475,12 @@ sub update_crm_commands {
> my $node = $1;
>
> my $state = $ns->get_node_state($node);
> - if ($state eq 'online') {
> + if ($ms->{disarm}) {
> + $haenv->log(
> + 'warn',
> + "ignoring maintenance command for node $node - HA stack is disarmed",
> + );
> + } elsif ($state eq 'online') {
> $ms->{node_request}->{$node}->{maintenance} = 1;
> } elsif ($state eq 'maintenance') {
> $haenv->log(
> @@ -493,6 +501,25 @@ sub update_crm_commands {
> );
> }
> delete $ms->{node_request}->{$node}->{maintenance}; # gets flushed out at the end of the CRM loop
> + } elsif ($cmd =~ m/^disarm-ha\s+(freeze|ignore)$/) {
> + my $mode = $1;
> +
> + if ($ms->{disarm}) {
> + $haenv->log(
> + 'warn',
> + "ignoring disarm-ha command - already in disarm state ($ms->{disarm}->{state})",
> + );
> + } else {
> + $haenv->log('info', "got crm command: disarm-ha $mode");
[0] e.g., here adding something along the lines of:
if ($mode eq 'ignore') {
for my $node (keys %{ $ms->{node_request} }) {
if ($ms->{node_request}->{$node}->{maintenance}) {
$haenv->log(
'warn',
"ignoring disarm-ha command - "
. "disable maintenance mode on node '$node' first",
);
return;
}
}
}
> + $ms->{disarm} = { mode => $mode, state => 'disarming' };
> + }
> + } elsif ($cmd =~ m/^arm-ha$/) {
[snip]
> +
> sub manage {
> my ($self) = @_;
>
> @@ -657,8 +765,12 @@ sub manage {
>
> # compute new service status
>
> + # skip service add/remove when disarmed - handle_disarm manages service status
> + my $is_disarmed = $ms->{disarm};
> +
> # add new service
> foreach my $sid (sort keys %$sc) {
> + next if $is_disarmed;
> next if $ss->{$sid}; # already there
> my $cd = $sc->{$sid};
> next if $cd->{state} eq 'ignored';
> @@ -675,6 +787,7 @@ sub manage {
>
> # remove stale or ignored services from manager state
> foreach my $sid (keys %$ss) {
> + next if $is_disarmed;
> next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
>
> my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
nit: Instead of checking for each $sid whether $is_disarmed, a single
guard could be used to skip service add/remove when disarmed, e.g.:
# skip service add/remove when disarmed - handle_disarm manages service status
if (!$ms->{disarm}) {
foreach my $sid (sort keys %$sc) {
next if $ss->{$sid}; # already there
my $cd = $sc->{$sid};
next if $cd->{state} eq 'ignored';
# ...
}
# remove stale or ignored services from manager state
foreach my $sid (keys %$ss) {
next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
# ...
}
}
> @@ -713,6 +826,15 @@ sub manage {
[snip]
> diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
> index 8cbf48d..b4f1d88 100644
> --- a/src/PVE/HA/Sim/Hardware.pm
> +++ b/src/PVE/HA/Sim/Hardware.pm
> @@ -835,6 +835,10 @@ sub sim_hardware_cmd {
> || $action eq 'disable-node-maintenance'
> ) {
> $self->queue_crm_commands_nolock("$action $node");
> + } elsif ($action eq 'disarm-ha') {
> + $self->queue_crm_commands_nolock("disarm-ha $params[0]");
nit: to increase readability and usability
my $mode = $params[0];
die "sim_hardware_cmd: unknown resource mode '$mode'"
if $mode !~ m/^(freeze|ignore)$/;
$self->queue_crm_commands_nolock("disarm-ha $mode");
> + } elsif ($action eq 'arm-ha') {
> + $self->queue_crm_commands_nolock("arm-ha");
> } else {
> die "sim_hardware_cmd: unknown action '$action'";
> }
[snip]
next prev parent reply other threads:[~2026-03-17 12:35 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-09 21:57 [PATCH ha-manager 0/3] fix #2751: implement disarm/arm HA for safer " Thomas Lamprecht
2026-03-09 21:57 ` [PATCH ha-manager 1/3] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
2026-03-09 21:57 ` [PATCH ha-manager 2/3] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
2026-03-17 12:36 ` Dominik Rusovac [this message]
2026-03-09 21:57 ` [PATCH ha-manager 3/3] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
2026-03-17 12:47 ` Dominik Rusovac
2026-03-17 12:35 ` [PATCH ha-manager 0/3] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DH528PN7O7LQ.3518UFY2X4TJR@proxmox.com \
--to=d.rusovac@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
--cc=t.lamprecht@proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.