From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id D88F21FF16F for ; Tue, 19 Aug 2025 23:39:05 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id BAE511D07C; Tue, 19 Aug 2025 23:40:43 +0200 (CEST) From: Marco Gaiarin Date: Tue, 19 Aug 2025 17:08:54 +0200 Organization: Il gaio usa sempre TIN per le liste, fallo anche tu!!! Message-ID: <4lgenl-oaf2.ln1@leia.lilliput.linux.it> X-Trace: eraldo.lilliput.linux.it 1755638439 316321 192.168.1.45 (19 Aug 2025 21:20:39 GMT) X-Mailer: tin/2.6.4-20240224 ("Banff") (Linux/6.14.0-28-generic (x86_64)) X-Gateway-System: SmartGate 1.4.5 To: pve-user@lists.proxmox.com X-SPAM-LEVEL: Spam detection results: 0 AWL -1.101 Adjusted score from AWL reputation of From: address BAYES_05 -0.5 Bayes spam probability is 1 to 5% DATE_IN_PAST_06_12 1.543 Date: is 6 to 12 hours before Received: date DMARC_PASS -0.1 DMARC pass policy JMQ_SPF_NEUTRAL 0.5 SPF set to ?all KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_PASS -0.001 SPF: HELO matches SPF record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [PVE-User] Quorum bouncing, VM does not start... X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE user list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-user-bounces@lists.proxmox.com Sender: "pve-user" We have some couples of servers in some local branch of our organization, in cluster but clearly not in failover (or 'automatic failover'); this is intended. Most of these branch offices close for summer holidays, when power outgage flourish. ;-) Rather frequently all the site get powered off, UPS do they job but sooner or later shutdown servers (and all other equipment) until some local employer goes to the site and re-power up all the site. The server are organized with two UPS (one per sever); the UPS power also a stack of two catalyst 2960S switches (again, one UPS per switches); all the server have a trunk/bond for every interface, a cable on switch1 and a cable on switch2 in the stack. We have recently upgraded to PVE 8, and found that if all the site get powered off, sometime but with a decent frequency, only some VMs get powered on. Digging the culprint we have found: 2025-08-07T10:49:19.997751+02:00 pdpve1 systemd[1]: Starting pve-guests.service - PVE guests... 2025-08-07T10:49:20.792333+02:00 pdpve1 pve-guests[2392]: starting task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam: 2025-08-07T10:49:20.794446+02:00 pdpve1 pvesh[2392]: waiting for quorum ... 2025-08-07T10:52:18.584607+02:00 pdpve1 pmxcfs[2021]: [status] notice: node has quorum 2025-08-07T10:52:18.879944+02:00 pdpve1 pvesh[2392]: got quorum 2025-08-07T10:52:18.891461+02:00 pdpve1 pve-guests[2393]: starting task UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam: 2025-08-07T10:52:18.891653+02:00 pdpve1 pve-guests[2950]: start VM 100: UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam: 2025-08-07T10:52:20.103473+02:00 pdpve1 pve-guests[2950]: VM 100 started with PID 2960. so servers restart, get quorum, start VM in order; but suddenly lost quorum: 2025-08-07T10:53:16.128336+02:00 pdpve1 pmxcfs[2021]: [status] notice: node lost quorum 2025-08-07T10:53:20.901367+02:00 pdpve1 pve-guests[2393]: cluster not ready - no quorum? 2025-08-07T10:53:20.903743+02:00 pdpve1 pvesh[2392]: cluster not ready - no quorum? 2025-08-07T10:53:20.905349+02:00 pdpve1 pve-guests[2392]: end task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam: cluster not ready - no quorum? 2025-08-07T10:53:20.922275+02:00 pdpve1 systemd[1]: Finished pve-guests.service - PVE guests. and subsequent VMs does not run; after some seconds, quorum get back, all goes normal. But VMs have to be run by hand. Clearly if we reboot or poweroff the two servers with the switch still powered on, all works as expected. We have managed to power on the server and do a reboot of the switch in the same time, and the trouble get triggered. So seems that the quorum get lost probably because the switch stop working for some time doing their things (eg, binding the second unit in the stack and doing ethernet bonds), that confuse the quorum, bang. We have tried to add: pvenode config set --startall-onboot-delay 120 an the two nodes, do the experiment (eg, start the server and reboot the switch) and the trouble does not trigger. Still i'm asking some feedback... particulary: 1) we was on PVE6: something are changed in quorum definition from PVE6 to PVE8? Because before upgrading we have never hit this... 2) there are better solution to this? Thanks. -- _______________________________________________ pve-user mailing list pve-user@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user