Re: [PVE-User] Quorum bouncing, VM does not start... - dorsy via pve-user

From: dorsy via pve-user <pve-user@lists.proxmox.com>
To: pve-user@lists.proxmox.com
Cc: dorsy <dorsyka@yahoo.com>
Subject: Re: [PVE-User] Quorum bouncing, VM does not start...
Date: Wed, 20 Aug 2025 00:53:45 +0200	[thread overview]
Message-ID: <mailman.179.1755644682.385.pve-user@lists.proxmox.com> (raw)
In-Reply-To: <4lgenl-oaf2.ln1@leia.lilliput.linux.it>

[-- Attachment #1: Type: message/rfc822, Size: 10975 bytes --]

From: dorsy <dorsyka@yahoo.com>
To: pve-user@lists.proxmox.com
Subject: Re: [PVE-User] Quorum bouncing, VM does not start...
Date: Wed, 20 Aug 2025 00:53:45 +0200
Message-ID: <fe1bd86d-62a7-42f2-8705-c0f50b1e4285@yahoo.com>

I'd suggest a direct link between the hosts for another quorum ring if 
You have a spare network port.
Also multiple rings could be more resilient than MLAG. But that is only 
my 2 cents opinion.

see: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy
and: 
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_corosync_over_bonds

On 8/19/2025 5:08 PM, Marco Gaiarin wrote:
> We have some couples of servers in some local branch of our organization,
> in cluster but clearly not in failover (or 'automatic failover'); this is
> intended.
>
> Most of these branch offices close for summer holidays, when power outgage
> flourish. ;-)
> Rather frequently all the site get powered off, UPS do they job but sooner
> or later shutdown servers (and all other equipment) until some local
> employer goes to the site and re-power up all the site.
>
> The server are organized with two UPS (one per sever); the UPS power also a
> stack of two catalyst 2960S switches (again, one UPS per switches); all the
> server have a trunk/bond for every interface, a cable on switch1 and a cable
> on switch2 in the stack.
>
>
> We have recently upgraded to PVE 8, and found that if all the site get
> powered off, sometime but with a decent frequency, only some VMs get powered
> on.
>
>
> Digging the culprint we have found:
>
>   2025-08-07T10:49:19.997751+02:00 pdpve1 systemd[1]: Starting pve-guests.service - PVE guests...
>   2025-08-07T10:49:20.792333+02:00 pdpve1 pve-guests[2392]: <root@pam> starting task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam:
>   2025-08-07T10:49:20.794446+02:00 pdpve1 pvesh[2392]: waiting for quorum ...
>   2025-08-07T10:52:18.584607+02:00 pdpve1 pmxcfs[2021]: [status] notice: node has quorum
>   2025-08-07T10:52:18.879944+02:00 pdpve1 pvesh[2392]: got quorum
>   2025-08-07T10:52:18.891461+02:00 pdpve1 pve-guests[2393]: <root@pam> starting task UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
>   2025-08-07T10:52:18.891653+02:00 pdpve1 pve-guests[2950]: start VM 100: UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
>   2025-08-07T10:52:20.103473+02:00 pdpve1 pve-guests[2950]: VM 100 started with PID 2960.
>
> so servers restart, get quorum, start VM in order; but suddenly lost quorum:
>
>   2025-08-07T10:53:16.128336+02:00 pdpve1 pmxcfs[2021]: [status] notice: node lost quorum
>   2025-08-07T10:53:20.901367+02:00 pdpve1 pve-guests[2393]: cluster not ready - no quorum?
>   2025-08-07T10:53:20.903743+02:00 pdpve1 pvesh[2392]: cluster not ready - no quorum?
>   2025-08-07T10:53:20.905349+02:00 pdpve1 pve-guests[2392]: <root@pam> end task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam: cluster not ready - no quorum?
>   2025-08-07T10:53:20.922275+02:00 pdpve1 systemd[1]: Finished pve-guests.service - PVE guests.
>
> and subsequent VMs does not run; after some seconds, quorum get back, all
> goes normal. But VMs have to be run by hand.
>
>
> Clearly if we reboot or poweroff the two servers with the switch still
> powered on, all works as expected.
> We have managed to power on the server and do a reboot of the switch in the
> same time, and the trouble get triggered.
>
>
> So seems that the quorum get lost probably because the switch stop working
> for some time doing their things (eg, binding the second unit in the stack
> and doing ethernet bonds), that confuse the quorum, bang.
>
> We have tried to add:
>
> 	pvenode config set --startall-onboot-delay 120
>
> an the two nodes, do the experiment (eg, start the server and reboot the
> switch) and the trouble does not trigger.
>
>
> Still i'm asking some feedback... particulary:
>
> 1) we was on PVE6: something are changed in quorum definition from PVE6 to
>    PVE8? Because before upgrading we have never hit this...
>
> 2) there are better solution to this?
>
>
> Thanks.
>
-- 
dorsy

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user