* [PVE-User] Quorum bouncing, VM does not start...
@ 2025-08-19 15:08 Marco Gaiarin
2025-08-19 22:53 ` dorsy via pve-user
0 siblings, 1 reply; 3+ messages in thread
From: Marco Gaiarin @ 2025-08-19 15:08 UTC (permalink / raw)
To: pve-user
We have some couples of servers in some local branch of our organization,
in cluster but clearly not in failover (or 'automatic failover'); this is
intended.
Most of these branch offices close for summer holidays, when power outgage
flourish. ;-)
Rather frequently all the site get powered off, UPS do they job but sooner
or later shutdown servers (and all other equipment) until some local
employer goes to the site and re-power up all the site.
The server are organized with two UPS (one per sever); the UPS power also a
stack of two catalyst 2960S switches (again, one UPS per switches); all the
server have a trunk/bond for every interface, a cable on switch1 and a cable
on switch2 in the stack.
We have recently upgraded to PVE 8, and found that if all the site get
powered off, sometime but with a decent frequency, only some VMs get powered
on.
Digging the culprint we have found:
2025-08-07T10:49:19.997751+02:00 pdpve1 systemd[1]: Starting pve-guests.service - PVE guests...
2025-08-07T10:49:20.792333+02:00 pdpve1 pve-guests[2392]: <root@pam> starting task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam:
2025-08-07T10:49:20.794446+02:00 pdpve1 pvesh[2392]: waiting for quorum ...
2025-08-07T10:52:18.584607+02:00 pdpve1 pmxcfs[2021]: [status] notice: node has quorum
2025-08-07T10:52:18.879944+02:00 pdpve1 pvesh[2392]: got quorum
2025-08-07T10:52:18.891461+02:00 pdpve1 pve-guests[2393]: <root@pam> starting task UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
2025-08-07T10:52:18.891653+02:00 pdpve1 pve-guests[2950]: start VM 100: UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
2025-08-07T10:52:20.103473+02:00 pdpve1 pve-guests[2950]: VM 100 started with PID 2960.
so servers restart, get quorum, start VM in order; but suddenly lost quorum:
2025-08-07T10:53:16.128336+02:00 pdpve1 pmxcfs[2021]: [status] notice: node lost quorum
2025-08-07T10:53:20.901367+02:00 pdpve1 pve-guests[2393]: cluster not ready - no quorum?
2025-08-07T10:53:20.903743+02:00 pdpve1 pvesh[2392]: cluster not ready - no quorum?
2025-08-07T10:53:20.905349+02:00 pdpve1 pve-guests[2392]: <root@pam> end task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam: cluster not ready - no quorum?
2025-08-07T10:53:20.922275+02:00 pdpve1 systemd[1]: Finished pve-guests.service - PVE guests.
and subsequent VMs does not run; after some seconds, quorum get back, all
goes normal. But VMs have to be run by hand.
Clearly if we reboot or poweroff the two servers with the switch still
powered on, all works as expected.
We have managed to power on the server and do a reboot of the switch in the
same time, and the trouble get triggered.
So seems that the quorum get lost probably because the switch stop working
for some time doing their things (eg, binding the second unit in the stack
and doing ethernet bonds), that confuse the quorum, bang.
We have tried to add:
pvenode config set --startall-onboot-delay 120
an the two nodes, do the experiment (eg, start the server and reboot the
switch) and the trouble does not trigger.
Still i'm asking some feedback... particulary:
1) we was on PVE6: something are changed in quorum definition from PVE6 to
PVE8? Because before upgrading we have never hit this...
2) there are better solution to this?
Thanks.
--
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PVE-User] Quorum bouncing, VM does not start...
2025-08-19 15:08 [PVE-User] Quorum bouncing, VM does not start Marco Gaiarin
@ 2025-08-19 22:53 ` dorsy via pve-user
2025-08-25 9:04 ` Marco Gaiarin
0 siblings, 1 reply; 3+ messages in thread
From: dorsy via pve-user @ 2025-08-19 22:53 UTC (permalink / raw)
To: pve-user; +Cc: dorsy
[-- Attachment #1: Type: message/rfc822, Size: 10975 bytes --]
From: dorsy <dorsyka@yahoo.com>
To: pve-user@lists.proxmox.com
Subject: Re: [PVE-User] Quorum bouncing, VM does not start...
Date: Wed, 20 Aug 2025 00:53:45 +0200
Message-ID: <fe1bd86d-62a7-42f2-8705-c0f50b1e4285@yahoo.com>
I'd suggest a direct link between the hosts for another quorum ring if
You have a spare network port.
Also multiple rings could be more resilient than MLAG. But that is only
my 2 cents opinion.
see: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy
and:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_corosync_over_bonds
On 8/19/2025 5:08 PM, Marco Gaiarin wrote:
> We have some couples of servers in some local branch of our organization,
> in cluster but clearly not in failover (or 'automatic failover'); this is
> intended.
>
> Most of these branch offices close for summer holidays, when power outgage
> flourish. ;-)
> Rather frequently all the site get powered off, UPS do they job but sooner
> or later shutdown servers (and all other equipment) until some local
> employer goes to the site and re-power up all the site.
>
> The server are organized with two UPS (one per sever); the UPS power also a
> stack of two catalyst 2960S switches (again, one UPS per switches); all the
> server have a trunk/bond for every interface, a cable on switch1 and a cable
> on switch2 in the stack.
>
>
> We have recently upgraded to PVE 8, and found that if all the site get
> powered off, sometime but with a decent frequency, only some VMs get powered
> on.
>
>
> Digging the culprint we have found:
>
> 2025-08-07T10:49:19.997751+02:00 pdpve1 systemd[1]: Starting pve-guests.service - PVE guests...
> 2025-08-07T10:49:20.792333+02:00 pdpve1 pve-guests[2392]: <root@pam> starting task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam:
> 2025-08-07T10:49:20.794446+02:00 pdpve1 pvesh[2392]: waiting for quorum ...
> 2025-08-07T10:52:18.584607+02:00 pdpve1 pmxcfs[2021]: [status] notice: node has quorum
> 2025-08-07T10:52:18.879944+02:00 pdpve1 pvesh[2392]: got quorum
> 2025-08-07T10:52:18.891461+02:00 pdpve1 pve-guests[2393]: <root@pam> starting task UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
> 2025-08-07T10:52:18.891653+02:00 pdpve1 pve-guests[2950]: start VM 100: UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
> 2025-08-07T10:52:20.103473+02:00 pdpve1 pve-guests[2950]: VM 100 started with PID 2960.
>
> so servers restart, get quorum, start VM in order; but suddenly lost quorum:
>
> 2025-08-07T10:53:16.128336+02:00 pdpve1 pmxcfs[2021]: [status] notice: node lost quorum
> 2025-08-07T10:53:20.901367+02:00 pdpve1 pve-guests[2393]: cluster not ready - no quorum?
> 2025-08-07T10:53:20.903743+02:00 pdpve1 pvesh[2392]: cluster not ready - no quorum?
> 2025-08-07T10:53:20.905349+02:00 pdpve1 pve-guests[2392]: <root@pam> end task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam: cluster not ready - no quorum?
> 2025-08-07T10:53:20.922275+02:00 pdpve1 systemd[1]: Finished pve-guests.service - PVE guests.
>
> and subsequent VMs does not run; after some seconds, quorum get back, all
> goes normal. But VMs have to be run by hand.
>
>
> Clearly if we reboot or poweroff the two servers with the switch still
> powered on, all works as expected.
> We have managed to power on the server and do a reboot of the switch in the
> same time, and the trouble get triggered.
>
>
> So seems that the quorum get lost probably because the switch stop working
> for some time doing their things (eg, binding the second unit in the stack
> and doing ethernet bonds), that confuse the quorum, bang.
>
> We have tried to add:
>
> pvenode config set --startall-onboot-delay 120
>
> an the two nodes, do the experiment (eg, start the server and reboot the
> switch) and the trouble does not trigger.
>
>
> Still i'm asking some feedback... particulary:
>
> 1) we was on PVE6: something are changed in quorum definition from PVE6 to
> PVE8? Because before upgrading we have never hit this...
>
> 2) there are better solution to this?
>
>
> Thanks.
>
--
dorsy
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PVE-User] Quorum bouncing, VM does not start...
2025-08-19 22:53 ` dorsy via pve-user
@ 2025-08-25 9:04 ` Marco Gaiarin
0 siblings, 0 replies; 3+ messages in thread
From: Marco Gaiarin @ 2025-08-25 9:04 UTC (permalink / raw)
To: dorsy via pve-user; +Cc: pve-user
Mandi! dorsy via pve-user
In chel di` si favelave...
> I'd suggest a direct link between the hosts for another quorum ring if
> You have a spare network port.
> Also multiple rings could be more resilient than MLAG. But that is only
> my 2 cents opinion.
> see: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy
> and:
> https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_corosync_over_bonds
Thans for the links. Seems to me that also 'LACP fast' is a pretty decent
solution for that!
--
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-08-25 18:10 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-19 15:08 [PVE-User] Quorum bouncing, VM does not start Marco Gaiarin
2025-08-19 22:53 ` dorsy via pve-user
2025-08-25 9:04 ` Marco Gaiarin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox