From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-user-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id D88F21FF16F
	for <inbox@lore.proxmox.com>; Tue, 19 Aug 2025 23:39:05 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id BAE511D07C;
	Tue, 19 Aug 2025 23:40:43 +0200 (CEST)
From: Marco Gaiarin <gaio@lilliput.linux.it>
Date: Tue, 19 Aug 2025 17:08:54 +0200
Organization: Il gaio usa sempre TIN per le liste, fallo anche tu!!!
Message-ID: <4lgenl-oaf2.ln1@leia.lilliput.linux.it>
X-Trace: eraldo.lilliput.linux.it 1755638439 316321 192.168.1.45 (19 Aug 2025
 21:20:39 GMT)
X-Mailer: tin/2.6.4-20240224 ("Banff") (Linux/6.14.0-28-generic (x86_64))
X-Gateway-System: SmartGate 1.4.5 <news.lilliput.linux.it>
To: pve-user@lists.proxmox.com
X-SPAM-LEVEL: Spam detection results:  0
 AWL -1.101 Adjusted score from AWL reputation of From: address
 BAYES_05                 -0.5 Bayes spam probability is 1 to 5%
 DATE_IN_PAST_06_12      1.543 Date: is 6 to 12 hours before Received: date
 DMARC_PASS               -0.1 DMARC pass policy
 JMQ_SPF_NEUTRAL           0.5 SPF set to ?all
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_PASS          -0.001 SPF: HELO matches SPF record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: [PVE-User] Quorum bouncing, VM does not start...
X-BeenThere: pve-user@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
List-Post: <mailto:pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox VE user list <pve-user@lists.proxmox.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pve-user-bounces@lists.proxmox.com
Sender: "pve-user" <pve-user-bounces@lists.proxmox.com>


We have some couples of servers in some local branch of our organization,
in cluster but clearly not in failover (or 'automatic failover'); this is
intended.

Most of these branch offices close for summer holidays, when power outgage
flourish. ;-)
Rather frequently all the site get powered off, UPS do they job but sooner
or later shutdown servers (and all other equipment) until some local
employer goes to the site and re-power up all the site.

The server are organized with two UPS (one per sever); the UPS power also a
stack of two catalyst 2960S switches (again, one UPS per switches); all the
server have a trunk/bond for every interface, a cable on switch1 and a cable
on switch2 in the stack.


We have recently upgraded to PVE 8, and found that if all the site get
powered off, sometime but with a decent frequency, only some VMs get powered
on.


Digging the culprint we have found:

 2025-08-07T10:49:19.997751+02:00 pdpve1 systemd[1]: Starting pve-guests.service - PVE guests...
 2025-08-07T10:49:20.792333+02:00 pdpve1 pve-guests[2392]: <root@pam> starting task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam:
 2025-08-07T10:49:20.794446+02:00 pdpve1 pvesh[2392]: waiting for quorum ...
 2025-08-07T10:52:18.584607+02:00 pdpve1 pmxcfs[2021]: [status] notice: node has quorum
 2025-08-07T10:52:18.879944+02:00 pdpve1 pvesh[2392]: got quorum
 2025-08-07T10:52:18.891461+02:00 pdpve1 pve-guests[2393]: <root@pam> starting task UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
 2025-08-07T10:52:18.891653+02:00 pdpve1 pve-guests[2950]: start VM 100: UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
 2025-08-07T10:52:20.103473+02:00 pdpve1 pve-guests[2950]: VM 100 started with PID 2960.

so servers restart, get quorum, start VM in order; but suddenly lost quorum:

 2025-08-07T10:53:16.128336+02:00 pdpve1 pmxcfs[2021]: [status] notice: node lost quorum
 2025-08-07T10:53:20.901367+02:00 pdpve1 pve-guests[2393]: cluster not ready - no quorum?
 2025-08-07T10:53:20.903743+02:00 pdpve1 pvesh[2392]: cluster not ready - no quorum?
 2025-08-07T10:53:20.905349+02:00 pdpve1 pve-guests[2392]: <root@pam> end task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam: cluster not ready - no quorum?
 2025-08-07T10:53:20.922275+02:00 pdpve1 systemd[1]: Finished pve-guests.service - PVE guests.

and subsequent VMs does not run; after some seconds, quorum get back, all
goes normal. But VMs have to be run by hand.


Clearly if we reboot or poweroff the two servers with the switch still
powered on, all works as expected.
We have managed to power on the server and do a reboot of the switch in the
same time, and the trouble get triggered.


So seems that the quorum get lost probably because the switch stop working
for some time doing their things (eg, binding the second unit in the stack
and doing ethernet bonds), that confuse the quorum, bang.

We have tried to add:

	pvenode config set --startall-onboot-delay 120

an the two nodes, do the experiment (eg, start the server and reboot the
switch) and the trouble does not trigger.


Still i'm asking some feedback... particulary:

1) we was on PVE6: something are changed in quorum definition from PVE6 to
  PVE8? Because before upgrading we have never hit this...

2) there are better solution to this?


Thanks.

-- 


_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user