From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id CF3841FF13B for ; Wed, 20 May 2026 09:48:49 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 4084E1B4C9; Wed, 20 May 2026 09:48:47 +0200 (CEST) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Wed, 20 May 2026 09:48:11 +0200 Message-Id: From: "Daniel Kral" To: "Daniel Kral" , "Fiona Ebner" , "Thomas Lamprecht" , Subject: Re: [PATCH ha-manager 2/2] make idle LRMs resolve leftover moving HA resources while disarmed X-Mailer: aerc 0.21.0-136-gdb9fe9896a79-dirty References: <20260519143842.382324-1-d.kral@proxmox.com> <20260519143842.382324-3-d.kral@proxmox.com> <5978c036-c864-413c-a4a7-6febe1b7f2b3@proxmox.com> In-Reply-To: X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1779263276177 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.075 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [manager.pm,proxmox.com] Message-ID-Hash: FAGRM2KLOEWJ6BFS6JHQNYYGJCTJLUGT X-Message-ID-Hash: FAGRM2KLOEWJ6BFS6JHQNYYGJCTJLUGT X-MailFrom: d.kral@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed May 20, 2026 at 8:53 AM CEST, Daniel Kral wrote: > On Tue May 19, 2026 at 6:00 PM CEST, Fiona Ebner wrote: >> Am 19.05.26 um 4:39 PM schrieb Daniel Kral: >>> + >>> + for my $sid (keys %$ss) { >>> + my ($state, $current_node, $target_node) =3D $ss->{$sid}->@{qw= (state node target)}; >>> + >>> + next if $node && (!$current_node || $current_node ne $node); >> >> Just wondering: when does !$current_node happen? > > AFAIK the only case where this can currently happen is if the HA > resource's guest doesn't exist in the cluster anymore according to the > pmxcfs' vmlist and isn't removed by HA Manager anymore (as is done when > the HA stack is in disarm mode). Sorry for the noise, had another look: the HA Manager never removes HA resources that have an undef node (e.g. if the VM was removed in some way that bypasses the check to also prune the HA resource from the config) no matter if the HA stack is disarming or not: # jq '.service_status["vm:2000"]' /etc/pve/ha/manager_status { "node": "pve", "uid": "pHQkcW2HF1jeyQJ5JLb/8Q", "state": "stopped" } # mv /etc/pve/nodes/pve/qemu-server/2000.conf . # jq '.service_status["vm:2000"]' /etc/pve/ha/manager_status { "state": "stopped", "uid": "wtRkyVgpB7LcmCtqGBtf+w", "node": null } As I tried it out a few times, this is also a cause why undef nodenames get written to the manager_status and as there was never a timestamp for the undef node entry the vm was tried to fenced which failed quite a few assumptions in the HA Manager: May 20 09:24:51 pve-2 pve-ha-crm[22795]: unable to score nodes according to= dynamic usage for service 'vm:2000' - did not get dynamic service usage in= formation for 'vm:2000' May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value in nume= ric comparison (<=3D>) at /usr/share/perl5/PVE/HA/Manager.pm line 390. May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value in nume= ric comparison (<=3D>) at /usr/share/perl5/PVE/HA/Manager.pm line 390. May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value in nume= ric comparison (<=3D>) at /usr/share/perl5/PVE/HA/Manager.pm line 390. May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value in nume= ric comparison (<=3D>) at /usr/share/perl5/PVE/HA/Manager.pm line 390. May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $curren= t_node in string eq at /usr/share/perl5/PVE/HA/Manager.pm line 396. May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $curren= t_node in string eq at /usr/share/perl5/PVE/HA/Manager.pm line 396. May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $curren= t_node in string eq at /usr/share/perl5/PVE/HA/Manager.pm line 396. May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $fenced= _node in concatenation (.) or string at /usr/share/perl5/PVE/HA/Manager.pm = line 1663. May 20 09:24:51 pve-2 pve-ha-crm[22795]: Use of uninitialized value $fenced= _node in string eq at /usr/share/perl5/PVE/HA/Manager.pm line 1664. May 20 09:24:51 pve-2 pve-ha-crm[22795]: recover service 'vm:2000' from fen= ced node '' to node 'pve' May 20 09:24:51 pve-2 pve-ha-crm[22795]: got unexpected error - Configurati= on file 'nodes/pve-2/qemu-server/2000.conf' does not exist This isn't something that should happen in normal circumstances though, I'll send a patch doing more checks to be defensive and/or removing the service status entry if the HA resource's guest isn't in the vmlist anymore, though for the latter I'll have to check if that could cause any trouble. Furthermore, if the HA resource then gets fenced, the HA Manager will acquire the lock for it's own node as get_ha_agent_lock($self, $node) defaults to the current nodename if $node is undef. Also might be worth to detect that HA resources have changed node inbetween in the HA Manager, as it currently doesn't update the node at all if it's moved manually and is already present in the HA Manager status... I'll look into it further, as I already wanted to change this behavior slightly for a partial fix of e.g. fenced HA resources, which were migrated in the mean time [0]. [0] https://bugzilla.proxmox.com/show_bug.cgi?id=3D6610 > >> >>> + >>> + $deferred_sids->{$sid} =3D 1 if grep { $state eq $_ } @deferra= ble_states; >>> + } >>> + >>> + return $deferred_sids; >>> +} >>> + >>> sub get_verbose_service_state { >>> my ($service_state, $service_conf) =3D @_; >>> =20