Re: [PVE-User] Corosync and Cluster reboot

From: "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>
To: "pve-user@lists.proxmox.com" <pve-user@lists.proxmox.com>
Subject: Re: [PVE-User] Corosync and Cluster reboot
Date: Tue, 7 Jan 2025 14:15:05 +0000	[thread overview]
Message-ID: <061153a5c032dd89e04d7e3ef54b8fbcdce5fb24.camel@groupe-cyllene.com> (raw)
In-Reply-To: <CAOKSTBuFw1ihaCA7AF_iDHaSbHJXHREGLVmdPPuFEkR9L3Zjsg@mail.gmail.com>

Personnaly, I'll recommand to disable HA  temporary during the network change  (mv /etc/pve/ha/resources.cfg  to a tmp directory,  stop all pve-ha-lrm   , tehn stop all pve-ha-crm   to stop  the watchdog)

Then, after the migration, check the corosync logs during 1 or 2 days , and after that , if no retransmit occur, reenable HA.

It's really possible that it's a corosync bug (I remember to have had this kind of error with pve 7.X)

Also, for "big" clusters (20-30 nodes), I'm using sctp protocol now, instead udp. for me , it's a lot more reliable when you have a network saturation on 1 now.

(I had the case of interne  udp flood attack coming from outside on 1 on my node, lagging the whole corosync cluster).²

corosync.conf

totem {
   cluster_name: ....
   ....
  interface {
      knet_transport: sctp
      linknumber: 0
  }
  ....

(This need a full restart of corosync everywhere, and HA need to be disable before, because udp can't communite with sctp, so you'll have a loss of quorum during the change)

-------- Message initial --------
De: Gilberto Ferreira <gilberto.nunes32@gmail.com<mailto:Gilberto%20Ferreira%20%3cgilberto.nunes32@gmail.com%3e>>
Répondre à: Proxmox VE user list <pve-user@lists.proxmox.com<mailto:Proxmox%20VE%20user%20list%20%3cpve-user@lists.proxmox.com%3e>>
À: Proxmox VE user list <pve-user@lists.proxmox.com<mailto:Proxmox%20VE%20user%20list%20%3cpve-user@lists.proxmox.com%3e>>
Objet: Re: [PVE-User] Corosync and Cluster reboot
Date: 07/01/2025 13:33:41

Just to clarify, I had a similar issue in a low latency network with 12
nodes cluster, all with 1G ethernet card.
After adding this token_retransmit to corosync.conf, no more problems.
Perhaps that could help you.

Em ter., 7 de jan. de 2025 às 09:01, Gilberto Ferreira <
gilberto.nunes32@gmail.com<mailto:gilberto.nunes32@gmail.com>> escreveu:

Try to add this in corosync.conf in one of the nodes:  token_retransmit:
200

Em ter., 7 de jan. de 2025 às 08:24, Iztok Gregori <
iztok.gregori@elettra.eu<mailto:iztok.gregori@elettra.eu>> escreveu:

Hi to all!

I need some help to understand a situation (cluster reboot) which
happened to us previous week. We are running a 17 nodes Proxmox cluster
with a separate Ceph cluster for storage (no hyper-convergence).

We have to upgrade a stack a 2 switches and in order to avoid any
downtime we decided to prepare a new (temporary) stack and move the
links from one switch to the other. Our procedure was the following:

- Migrate all the VM from node.
- Unplug the links from the old switch.
- Plug the links to the temporary switch.
- Wait till the node is available again in the cluster.
- Repeat.

We have to move 8 nodes from one switch to the other. The first 4 nodes
went smoothly, but when we did plug the 5th node into the new switch ALL
the nodes which have configured HA VMs rebooted!

 From the Corosync logs I see that the Token wasn't received and because
of that watchdog-mux wasn't updated causing the node reboot.

Here are the Corosync logs during the procedure and before the nodes
restarted. It was captured from a node which didn't reboot (pve-ha-lrm:
idle):

12:51:57 [KNET  ] link: host: 18 link: 0 is down
12:51:57 [KNET  ] host: host: 18 (passive) best link: 0 (pri: 1)
12:51:57 [KNET  ] host: host: 18 has no active links
12:52:02 [TOTEM ] Token has not been received in 9562 ms
12:52:16 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16
17 19
12:52:16 [QUORUM] Sync left[1]: 18
12:52:16 [TOTEM ] A new membership (1.d29) was formed. Members left: 18
12:52:16 [TOTEM ] Failed to receive the leave message. failed: 18
12:52:16 [QUORUM] Members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 19
12:52:16 [MAIN  ] Completed service synchronization, ready to provide
service.
12:52:42 [KNET  ] rx: host: 18 link: 0 is up
12:52:42 [KNET  ] host: host: 18 (passive) best link: 0 (pri: 1)
12:52:50 [TOTEM ] Token has not been received in 9567 ms
12:53:01 [TOTEM ] Token has not been received in 20324 ms
12:53:11 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16
17 19
12:53:11 [TOTEM ] A new membership (1.d35) was formed. Members
12:53:20 [TOTEM ] Token has not been received in 9570 ms
12:53:31 [TOTEM ] Token has not been received in 20326 ms
12:53:41 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16
17 19
12:53:41 [TOTEM ] A new membership (1.d41) was formed. Members
12:53:50 [TOTEM ] Token has not been received in 9570 ms

And here you can find the logs of a successfully completed "procedure":

12:19:12 [KNET  ] link: host: 19 link: 0 is down
12:19:12 [KNET  ] host: host: 19 (passive) best link: 0 (pri: 1)
12:19:12 [KNET  ] host: host: 19 has no active links
12:19:17 [TOTEM ] Token has not been received in 9562 ms
12:19:31 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16
17 18
12:19:31 [QUORUM] Sync left[1]: 19
12:19:31 [TOTEM ] A new membership (1.d21) was formed. Members left: 19
12:19:31 [TOTEM ] Failed to receive the leave message. failed: 19
12:19:31 [QUORUM] Members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 18
12:19:31 [MAIN  ] Completed service synchronization, ready to provide
service.
12:19:47 [KNET  ] rx: host: 19 link: 0 is up
12:19:47 [KNET  ] host: host: 19 (passive) best link: 0 (pri: 1)
12:19:50 [QUORUM] Sync members[17]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16
17 18 19
12:19:50 [QUORUM] Sync joined[1]: 19
12:19:50 [TOTEM ] A new membership (1.d25) was formed. Members joined:
19
12:19:51 [QUORUM] Members[17]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 18
19
12:19:51 [MAIN  ] Completed service synchronization, ready to provide
service.

Comparing the 2 logs I can see that after the "host: 18" link was found
active again the token was not received, but I cannot figure out what
went different in this case.

I have 2 possible culprits:

1. NETWORK

The cluster network is backed up with 5 Extreme Networks switches, 3
stacks of two x870 (100GBE), 1 stack of two x770 (40GBE) and one
temporary stack of two 7720-32C (100GBE). The switches are linked
together by a 2x LACP bond, and the 99% of the cluster communication are
on 100GBE.

The hosts are connected to the network with different speed interfaces:
10GBE (1 node), 25GBE (4 nodes), 40GBE (1 node), 100GBE (11 nodes). All
the nodes are bonded, the Corosync network (is the same as the
management one) is defined on a bridge interface on the bonded link
(configuration is almost the same on all nodes, some older ones have
balance-xor the other have lacp as bonding mode).

Is it possible that there is something wrong with the network, but I
cannot find a probable cause. From the data that I have, I don't see
nothing special, no links were saturated, no error logged...

2. COROSYNC

The cluster is running a OLD version of Proxmox (7.1-12) with Corosync
3.1.5-pve2. Is possible that there is a problem in Corosync fixed in a
later release. I did a quick search but I didn't found anything. The
cluster upgrade is on my to-do list (but the list is huge, so it will
not be done tomorrow).

We are running only one Corosync network which is the same as the
management/migration one, but different from the one for
client/storage/backup. The configuration is very basic, I think is the
default one, I can provide it if needed.

I checked the Corosync stats and the average latency is around 150
(microseconds?) along all links on all nodes.

====

In general it can be a combination of the 2 above or something
completely different.

Do you have some advice on where to look to debug further?

I can provide more information if needed.

Thanks a lot!

Iztok

--
Iztok Gregori
ICT Systems and Services
Elettra - Sincrotrone Trieste S.C.p.A.
Telephone: +39 040 3758948
https://antiphishing.vadesecure.com/v4?f=dm5IcURvcWhWR2JSb0duRJW0URV3OQ1N5DTDN5WecuZPqfd57y_JfM3q_jeMEojZP18rSuaCgYpaLxGTMr1KzQ&i=ejY5OG0wdDQ5WURramxtRHUVvu7RitsNddDG-KMPMXM&k=Ip2S&r=bnJ6VDc1TGFZSldEN080OWva484AqOkTvFekqP6D-OSBQrmfxEC1mHs-7Em9BAm5S90EYTTnp7VEHFBQEB4_Qw&s=d03373de57da904eeab6763d06f4da807c2b84edd5c15564900ba03e9c92f801&u=http%3A%2F%2Fwww.elettra.eu

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com<mailto:pve-user@lists.proxmox.com>
https://antiphishing.vadesecure.com/v4?f=dm5IcURvcWhWR2JSb0duRJW0URV3OQ1N5DTDN5WecuZPqfd57y_JfM3q_jeMEojZP18rSuaCgYpaLxGTMr1KzQ&i=ejY5OG0wdDQ5WURramxtRHUVvu7RitsNddDG-KMPMXM&k=Ip2S&r=bnJ6VDc1TGFZSldEN080OWva484AqOkTvFekqP6D-OSBQrmfxEC1mHs-7Em9BAm5S90EYTTnp7VEHFBQEB4_Qw&s=99a6c22ef3cf974704323d65d59f3afc16142436b30fdba1e1bb6cd4e573af1a&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com<mailto:pve-user@lists.proxmox.com>
https://antiphishing.vadesecure.com/v4?f=dm5IcURvcWhWR2JSb0duRJW0URV3OQ1N5DTDN5WecuZPqfd57y_JfM3q_jeMEojZP18rSuaCgYpaLxGTMr1KzQ&i=ejY5OG0wdDQ5WURramxtRHUVvu7RitsNddDG-KMPMXM&k=Ip2S&r=bnJ6VDc1TGFZSldEN080OWva484AqOkTvFekqP6D-OSBQrmfxEC1mHs-7Em9BAm5S90EYTTnp7VEHFBQEB4_Qw&s=99a6c22ef3cf974704323d65d59f3afc16142436b30fdba1e1bb6cd4e573af1a&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user