Re: [pve-devel] [PATCH docs v3] pvecm, network: add section on corosync over bonds

From: Friedrich Weber <f.weber@proxmox.com>
To: Hannes Duerr <h.duerr@proxmox.com>,
	Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH docs v3] pvecm, network: add section on corosync over bonds
Date: Tue, 29 Jul 2025 18:25:59 +0200	[thread overview]
Message-ID: <3f12167e-892a-4945-a43d-cab642464e2d@proxmox.com> (raw)
In-Reply-To: <86841dab-96d5-403d-a91f-a69d17b716a7@proxmox.com>

Thanks for taking a look!

Discussed with HD off-list:

- having the justification for the recommendations in the docs is good
- but since the justification somewhat complex, probably not good to
have it directly in the beginning of the new section.
- It might be better to have the recommendations first ("We recommend
[...]" plus the list of bond modes), and the justification below that,
so readers immediately see the important part, and can optionally still
read about the justification.

Hence I'll send v3 that rearranges the paragraphs.

On 28/07/2025 18:16, Hannes Duerr wrote:
> 
> On 7/25/25 4:03 PM, Friedrich Weber wrote:
>> +Corosync Over Bonds
>> +~~~~~~~~~~~~~~~~~~~
>> +
>> +Using a xref:sysadmin_network_bond[bond] as a Corosync link can be problematic
>> +in certain failure scenarios. If one of the bonded interfaces fails and stops
>> +transmitting packets, but its link state stays up, and there are no other
>> +Corosync links available
> I thought it can also occur if the are still other Corosync links available?

In my tests so far, it didn't. Even if the bond is the primary corosync
link, as long as there is still a fallback link available, corosync
seems to simply switch over to the fallback link.

Here 172.16.0.0/24 is the LACP-bonded network, and I stopped traffic on
one bonded NIC of node 2. corosync just says:

On node 1:

Jul 29 11:31:39 pve1 corosync[841]:   [KNET  ] link: host: 2 link: 0 is down
Jul 29 11:31:39 pve1 corosync[841]:   [KNET  ] host: host: 2 (passive)
best link: 1 (pri: 1)
Jul 29 11:31:39 pve1 corosync[841]:   [KNET  ] host: host: 2 (passive)
best link: 1 (pri: 1)

On node 2:

Jul 29 11:31:39 pve2 corosync[837]:   [KNET  ] link: host: 4 link: 0 is down
Jul 29 11:31:39 pve2 corosync[837]:   [KNET  ] link: host: 1 link: 0 is down
Jul 29 11:31:39 pve2 corosync[837]:   [KNET  ] host: host: 4 (passive)
best link: 1 (pri: 1)
Jul 29 11:31:39 pve2 corosync[837]:   [KNET  ] host: host: 1 (passive)
best link: 1 (pri: 1)

And nothing on the other nodes.

corosync-cfgtool reports the following on the four nodes, note that only
the 1<->2 link is not "connected":

Local node ID 1, transport knet
nodeid: 2 reachable
   LINK: 0 udp (172.16.0.101->172.16.0.102) enabled mtu: 1397
   LINK: 1 udp (192.168.0.101->192.168.0.102) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.0.101->172.16.0.103) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.101->192.168.0.103) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.0.101->172.16.0.104) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.101->192.168.0.104) enabled connected mtu: 1397

Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.0.102->172.16.0.101) enabled mtu: 1397
   LINK: 1 udp (192.168.0.102->192.168.0.101) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.0.102->172.16.0.103) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.102->192.168.0.103) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.0.102->172.16.0.104) enabled mtu: 1397
   LINK: 1 udp (192.168.0.102->192.168.0.104) enabled connected mtu: 1397

Local node ID 3, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.0.103->172.16.0.101) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.103->192.168.0.101) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (172.16.0.103->172.16.0.102) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.103->192.168.0.102) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.0.103->172.16.0.104) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.103->192.168.0.104) enabled connected mtu: 1397

Local node ID 4, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.0.104->172.16.0.101) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.104->192.168.0.101) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (172.16.0.104->172.16.0.102) enabled mtu: 1397
   LINK: 1 udp (192.168.0.104->192.168.0.102) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.0.104->172.16.0.103) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.104->192.168.0.103) enabled connected mtu: 1397

With `bond-lacp-rate slow`, this switches over to "connected" for all four
interfaces after ~90 seconds.

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel