From: "Dominik Rusovac" <d.rusovac@proxmox.com>
To: "Thomas Lamprecht" <t.lamprecht@proxmox.com>,
<pve-devel@lists.proxmox.com>
Subject: Re: [PATCH ha-manager 2/3] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance
Date: Tue, 17 Mar 2026 13:36:08 +0100 [thread overview]
Message-ID: <DH528PN7O7LQ.3518UFY2X4TJR@proxmox.com> (raw)
In-Reply-To: <20260309220128.973793-3-t.lamprecht@proxmox.com>
Please see my comments inline.
On Mon Mar 9, 2026 at 10:57 PM CET, Thomas Lamprecht wrote:
[snip]
> More test ideas welcome, but testing on real clusters would be actually
> even better ;-)
In test src/test/test-disarm-maintenance1/ with cmdlist
[
[ "power node1 on", "power node2 on", "power node3 on"],
[ "crm node3 enable-node-maintenance" ],
[ "crm node1 disarm-ha freeze" ],
[ "crm node1 arm-ha" ],
[ "crm node3 disable-node-maintenance" ]
]
the services that have been migrated away from node3 (due to the enabled
node-maintenance) are moved back to node3 upon re-arming HA and
disabling node-maintenance, which is fine.
However, if HA was disarmed with resource mode 'ignore' instead of
'freeze', the services are not moved back upon re-arming HA and
disabling node-maintenance.
It might be a good idea to:
- prohibit disarming HA with 'ignore' if at least one node is in
maintenance mode (see [0]); and
- add a test correponding to src/test/test-disarm-maintenance1/ but
with cmdlist:
[
[ "power node1 on", "power node2 on", "power node3 on"],
[ "crm node3 enable-node-maintenance" ],
[ "crm node1 disarm-ha ignore" ],
[ "crm node3 disable-node-maintenance" ],
[ "crm node1 disarm-ha ignore" ],
[ "crm node1 arm-ha" ]
]
where the expectation is that:
- the command to disarm HA with 'ignore' will be ignored;
- the CRM logs on warn-level that maintenance on node3 ought to be
disabled first; and
- after disabling node-maintenance on node3, the command to disarm HA
with 'ignore' will no longer be ignored.
>
[snip]
> +++ b/src/PVE/HA/Manager.pm
> @@ -75,6 +75,9 @@ sub new {
> # on change of active master.
> $self->{ms}->{node_request} = $old_ms->{node_request} if defined($old_ms->{node_request});
>
> + # preserve disarm state across CRM master changes
> + $self->{ms}->{disarm} = $old_ms->{disarm} if defined($old_ms->{disarm});
> +
> $self->update_crs_scheduler_mode(); # initial set, we update it once every loop
>
> return $self;
> @@ -472,7 +475,12 @@ sub update_crm_commands {
> my $node = $1;
>
> my $state = $ns->get_node_state($node);
> - if ($state eq 'online') {
> + if ($ms->{disarm}) {
> + $haenv->log(
> + 'warn',
> + "ignoring maintenance command for node $node - HA stack is disarmed",
> + );
> + } elsif ($state eq 'online') {
> $ms->{node_request}->{$node}->{maintenance} = 1;
> } elsif ($state eq 'maintenance') {
> $haenv->log(
> @@ -493,6 +501,25 @@ sub update_crm_commands {
> );
> }
> delete $ms->{node_request}->{$node}->{maintenance}; # gets flushed out at the end of the CRM loop
> + } elsif ($cmd =~ m/^disarm-ha\s+(freeze|ignore)$/) {
> + my $mode = $1;
> +
> + if ($ms->{disarm}) {
> + $haenv->log(
> + 'warn',
> + "ignoring disarm-ha command - already in disarm state ($ms->{disarm}->{state})",
> + );
> + } else {
> + $haenv->log('info', "got crm command: disarm-ha $mode");
[0] e.g., here adding something along the lines of:
if ($mode eq 'ignore') {
for my $node (keys %{ $ms->{node_request} }) {
if ($ms->{node_request}->{$node}->{maintenance}) {
$haenv->log(
'warn',
"ignoring disarm-ha command - "
. "disable maintenance mode on node '$node' first",
);
return;
}
}
}
> + $ms->{disarm} = { mode => $mode, state => 'disarming' };
> + }
> + } elsif ($cmd =~ m/^arm-ha$/) {
[snip]
> +
> sub manage {
> my ($self) = @_;
>
> @@ -657,8 +765,12 @@ sub manage {
>
> # compute new service status
>
> + # skip service add/remove when disarmed - handle_disarm manages service status
> + my $is_disarmed = $ms->{disarm};
> +
> # add new service
> foreach my $sid (sort keys %$sc) {
> + next if $is_disarmed;
> next if $ss->{$sid}; # already there
> my $cd = $sc->{$sid};
> next if $cd->{state} eq 'ignored';
> @@ -675,6 +787,7 @@ sub manage {
>
> # remove stale or ignored services from manager state
> foreach my $sid (keys %$ss) {
> + next if $is_disarmed;
> next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
>
> my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
nit: Instead of checking for each $sid whether $is_disarmed, a single
guard could be used to skip service add/remove when disarmed, e.g.:
# skip service add/remove when disarmed - handle_disarm manages service status
if (!$ms->{disarm}) {
foreach my $sid (sort keys %$sc) {
next if $ss->{$sid}; # already there
my $cd = $sc->{$sid};
next if $cd->{state} eq 'ignored';
# ...
}
# remove stale or ignored services from manager state
foreach my $sid (keys %$ss) {
next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no config';
# ...
}
}
> @@ -713,6 +826,15 @@ sub manage {
[snip]
> diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
> index 8cbf48d..b4f1d88 100644
> --- a/src/PVE/HA/Sim/Hardware.pm
> +++ b/src/PVE/HA/Sim/Hardware.pm
> @@ -835,6 +835,10 @@ sub sim_hardware_cmd {
> || $action eq 'disable-node-maintenance'
> ) {
> $self->queue_crm_commands_nolock("$action $node");
> + } elsif ($action eq 'disarm-ha') {
> + $self->queue_crm_commands_nolock("disarm-ha $params[0]");
nit: to increase readability and usability
my $mode = $params[0];
die "sim_hardware_cmd: unknown resource mode '$mode'"
if $mode !~ m/^(freeze|ignore)$/;
$self->queue_crm_commands_nolock("disarm-ha $mode");
> + } elsif ($action eq 'arm-ha') {
> + $self->queue_crm_commands_nolock("arm-ha");
> } else {
> die "sim_hardware_cmd: unknown action '$action'";
> }
[snip]
next prev parent reply other threads:[~2026-03-17 12:35 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-09 21:57 [PATCH ha-manager 0/3] fix #2751: implement disarm/arm HA for safer " Thomas Lamprecht
2026-03-09 21:57 ` [PATCH ha-manager 1/3] api: status: add fencing status entry with armed/standby state Thomas Lamprecht
2026-03-09 21:57 ` [PATCH ha-manager 2/3] fix #2751: implement disarm-ha and arm-ha for safe cluster maintenance Thomas Lamprecht
2026-03-17 12:36 ` Dominik Rusovac [this message]
2026-03-09 21:57 ` [PATCH ha-manager 3/3] api: status: add disarm-ha and arm-ha endpoints and CLI wiring Thomas Lamprecht
2026-03-17 12:47 ` Dominik Rusovac
2026-03-17 12:35 ` [PATCH ha-manager 0/3] fix #2751: implement disarm/arm HA for safer cluster maintenance Dominik Rusovac
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DH528PN7O7LQ.3518UFY2X4TJR@proxmox.com \
--to=d.rusovac@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
--cc=t.lamprecht@proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox