From: dorsy via pve-user <pve-user@lists.proxmox.com>
To: pve-user@lists.proxmox.com
Cc: dorsy <dorsyka@yahoo.com>
Subject: Re: [PVE-User] Quorum bouncing, VM does not start...
Date: Wed, 20 Aug 2025 00:53:45 +0200 [thread overview]
Message-ID: <mailman.179.1755644682.385.pve-user@lists.proxmox.com> (raw)
In-Reply-To: <4lgenl-oaf2.ln1@leia.lilliput.linux.it>
[-- Attachment #1: Type: message/rfc822, Size: 10975 bytes --]
From: dorsy <dorsyka@yahoo.com>
To: pve-user@lists.proxmox.com
Subject: Re: [PVE-User] Quorum bouncing, VM does not start...
Date: Wed, 20 Aug 2025 00:53:45 +0200
Message-ID: <fe1bd86d-62a7-42f2-8705-c0f50b1e4285@yahoo.com>
I'd suggest a direct link between the hosts for another quorum ring if
You have a spare network port.
Also multiple rings could be more resilient than MLAG. But that is only
my 2 cents opinion.
see: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy
and:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_corosync_over_bonds
On 8/19/2025 5:08 PM, Marco Gaiarin wrote:
> We have some couples of servers in some local branch of our organization,
> in cluster but clearly not in failover (or 'automatic failover'); this is
> intended.
>
> Most of these branch offices close for summer holidays, when power outgage
> flourish. ;-)
> Rather frequently all the site get powered off, UPS do they job but sooner
> or later shutdown servers (and all other equipment) until some local
> employer goes to the site and re-power up all the site.
>
> The server are organized with two UPS (one per sever); the UPS power also a
> stack of two catalyst 2960S switches (again, one UPS per switches); all the
> server have a trunk/bond for every interface, a cable on switch1 and a cable
> on switch2 in the stack.
>
>
> We have recently upgraded to PVE 8, and found that if all the site get
> powered off, sometime but with a decent frequency, only some VMs get powered
> on.
>
>
> Digging the culprint we have found:
>
> 2025-08-07T10:49:19.997751+02:00 pdpve1 systemd[1]: Starting pve-guests.service - PVE guests...
> 2025-08-07T10:49:20.792333+02:00 pdpve1 pve-guests[2392]: <root@pam> starting task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam:
> 2025-08-07T10:49:20.794446+02:00 pdpve1 pvesh[2392]: waiting for quorum ...
> 2025-08-07T10:52:18.584607+02:00 pdpve1 pmxcfs[2021]: [status] notice: node has quorum
> 2025-08-07T10:52:18.879944+02:00 pdpve1 pvesh[2392]: got quorum
> 2025-08-07T10:52:18.891461+02:00 pdpve1 pve-guests[2393]: <root@pam> starting task UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
> 2025-08-07T10:52:18.891653+02:00 pdpve1 pve-guests[2950]: start VM 100: UPID:pdpve1:00000B86:00005711:68946942:qmstart:100:root@pam:
> 2025-08-07T10:52:20.103473+02:00 pdpve1 pve-guests[2950]: VM 100 started with PID 2960.
>
> so servers restart, get quorum, start VM in order; but suddenly lost quorum:
>
> 2025-08-07T10:53:16.128336+02:00 pdpve1 pmxcfs[2021]: [status] notice: node lost quorum
> 2025-08-07T10:53:20.901367+02:00 pdpve1 pve-guests[2393]: cluster not ready - no quorum?
> 2025-08-07T10:53:20.903743+02:00 pdpve1 pvesh[2392]: cluster not ready - no quorum?
> 2025-08-07T10:53:20.905349+02:00 pdpve1 pve-guests[2392]: <root@pam> end task UPID:pdpve1:00000959:0000117F:68946890:startall::root@pam: cluster not ready - no quorum?
> 2025-08-07T10:53:20.922275+02:00 pdpve1 systemd[1]: Finished pve-guests.service - PVE guests.
>
> and subsequent VMs does not run; after some seconds, quorum get back, all
> goes normal. But VMs have to be run by hand.
>
>
> Clearly if we reboot or poweroff the two servers with the switch still
> powered on, all works as expected.
> We have managed to power on the server and do a reboot of the switch in the
> same time, and the trouble get triggered.
>
>
> So seems that the quorum get lost probably because the switch stop working
> for some time doing their things (eg, binding the second unit in the stack
> and doing ethernet bonds), that confuse the quorum, bang.
>
> We have tried to add:
>
> pvenode config set --startall-onboot-delay 120
>
> an the two nodes, do the experiment (eg, start the server and reboot the
> switch) and the trouble does not trigger.
>
>
> Still i'm asking some feedback... particulary:
>
> 1) we was on PVE6: something are changed in quorum definition from PVE6 to
> PVE8? Because before upgrading we have never hit this...
>
> 2) there are better solution to this?
>
>
> Thanks.
>
--
dorsy
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
next prev parent reply other threads:[~2025-08-19 23:03 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-19 15:08 Marco Gaiarin
2025-08-19 22:53 ` dorsy via pve-user [this message]
2025-08-25 9:04 ` Marco Gaiarin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mailman.179.1755644682.385.pve-user@lists.proxmox.com \
--to=pve-user@lists.proxmox.com \
--cc=dorsyka@yahoo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox