all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH docs v3] pvecm, network: add section on corosync over bonds
@ 2025-07-25 14:01 Friedrich Weber
  2025-07-28 16:16 ` Hannes Duerr
  2025-07-30  8:59 ` [pve-devel] superseded: " Friedrich Weber
  0 siblings, 2 replies; 4+ messages in thread
From: Friedrich Weber @ 2025-07-25 14:01 UTC (permalink / raw)
  To: pve-devel

Testing has shown that running corosync (only) over a bond can be
problematic in some failure scenarios and for certain bond modes. The
documentation only discourages bonds for corosync because corosync can
switch between available networks itself, but does not mention other
caveats when using bonds for corosync.

Hence, extend the documentation with recommendations and caveats
regarding bonds for corosync.

Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
---

Notes:
    Aaron suggested we could expose the bond-lacp-rate in the GUI to
    make it easier to change the setting on the PVE side. I'd open a
    feature report for this.
    
    Changes since v2:
    - fix wording in the failure scenario description
    - explain that load-balancing bond modes are affected and why
    - clarify that the caveats apply whenver a bond is used for Corosync
      traffic (even if only as a redundant link)
    
    Changes since v1:
    - move to its own section under "Cluster Network"
    - reword remarks about bond-lacp-rate fast
    - reword remark under "Requirements"

 pve-network.adoc |  4 +++-
 pvecm.adoc       | 49 ++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/pve-network.adoc b/pve-network.adoc
index 2dec882..b361f97 100644
--- a/pve-network.adoc
+++ b/pve-network.adoc
@@ -495,7 +495,9 @@ use the active-backup mode.
 
 For the cluster network (Corosync) we recommend configuring it with multiple
 networks. Corosync does not need a bond for network redundancy as it can switch
-between networks by itself, if one becomes unusable.
+between networks by itself, if one becomes unusable. Some bond modes are known
+to be problematic for Corosync, see
+xref:pvecm_corosync_over_bonds[Corosync over Bonds].
 
 The following bond configuration can be used as distributed/shared
 storage network. The benefit would be that you get more speed and the
diff --git a/pvecm.adoc b/pvecm.adoc
index 312a26f..60afaaf 100644
--- a/pvecm.adoc
+++ b/pvecm.adoc
@@ -89,10 +89,8 @@ NOTE: To ensure reliable Corosync redundancy, it is essential to have at least
 another link on a different physical network. This enables Corosync to keep the
 cluster communication alive should the dedicated network be down.
 +
-NOTE: A single link backed by a bond is not enough to provide Corosync
-redundancy. When a bonded interface fails and Corosync cannot fall back to
-another link, it can lead to  asymmetric communication in the cluster, which in
-turn can lead to the cluster losing quorum.
+NOTE: A single link backed by a bond can be problematic in certain failure
+scenarios, see xref:pvecm_corosync_over_bonds[Corosync Over Bonds].
 
 * The root password of a cluster node is required for adding nodes.
 
@@ -606,6 +604,49 @@ transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf
 but keep in mind that this will disable all cryptography and redundancy support.
 This is therefore not recommended.
 
+[[pvecm_corosync_over_bonds]]
+Corosync Over Bonds
+~~~~~~~~~~~~~~~~~~~
+
+Using a xref:sysadmin_network_bond[bond] as a Corosync link can be problematic
+in certain failure scenarios. If one of the bonded interfaces fails and stops
+transmitting packets, but its link state stays up, and there are no other
+Corosync links available, some bond modes may cause a state of asymmetric
+connectivity where cluster nodes can only communicate with different subsets of
+other nodes. Affected are bond modes that provide load balancing, as these
+modes may still try to send out a subset of packets via the failed interface.
+In case of asymmetric connectivity, Corosync may not be able to form a stable
+quorum in the cluster. If this state persists and HA is enabled, nodes may
+fence themselves, even if their respective bond is still fully functioning. In
+the worst case, the whole cluster may fence itself.
+
+We recommend at least one dedicated physical NIC for the primary Corosync link,
+see xref:pvecm_cluster_requirements[Requirements]. Bonds may be used as
+additional links for increased redundancy. To avoid fencing in the failure
+scenario outlined above, the following caveats apply whenever a bond is used
+for Corosync traffic:
+
+* We *advise against* using bond modes *balance-rr*, *balance-xor*,
+  *balance-tlb*, or *balance-alb* for Corosync traffic. As explained above,
+  they can cause asymmetric connectivity in certain failure scenarios.
+
+* *IEEE 802.3ad (LACP)*: This bond mode can cause asymmetric connectivity in
+  certain failure scenarios as explained above, but it can recover from this
+  state, as each side of the bond (Proxmox VE node and switch) can stop using a
+  bonded interface if it has not received three LACPDUs in a row on it.
+  However, with default settings, LACPDUs are only sent every 30 seconds,
+  yielding a failover time of 90 seconds. This is too long, as nodes with HA
+  resources will fence themselves already after roughly one minute without a
+  stable quorum. If LACP bonds are used for corosync traffic, we recommend
+  setting `bond-lacp-rate fast` *on the Proxmox VE node and the switch*!
+  Setting this option on one side requests the other side to send an LACPDU
+  every second. Setting this option on both sides can reduce the failover time
+  in the scenario above to 3 seconds and thus prevent fencing.
+
+* Bond mode *active-backup* will not cause asymmetric connectivity in the
+  failure scenario described above. The node whose bond experienced the failure
+  may lose connection to the cluster and, if HA is enabled, fence itself.
+
 Separate Cluster Network
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.47.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH docs v3] pvecm, network: add section on corosync over bonds
  2025-07-25 14:01 [pve-devel] [PATCH docs v3] pvecm, network: add section on corosync over bonds Friedrich Weber
@ 2025-07-28 16:16 ` Hannes Duerr
  2025-07-29 16:25   ` Friedrich Weber
  2025-07-30  8:59 ` [pve-devel] superseded: " Friedrich Weber
  1 sibling, 1 reply; 4+ messages in thread
From: Hannes Duerr @ 2025-07-28 16:16 UTC (permalink / raw)
  To: Proxmox VE development discussion, Friedrich Weber


On 7/25/25 4:03 PM, Friedrich Weber wrote:
> +Corosync Over Bonds
> +~~~~~~~~~~~~~~~~~~~
> +
> +Using a xref:sysadmin_network_bond[bond] as a Corosync link can be problematic
> +in certain failure scenarios. If one of the bonded interfaces fails and stops
> +transmitting packets, but its link state stays up, and there are no other
> +Corosync links available
I thought it can also occur if the are still other Corosync links available?
If i understand the next part correct you're even assuming it?
> , some bond modes may cause a state of asymmetric
> +connectivity where cluster nodes can only communicate with different subsets of
> +other nodes. Affected are bond modes that provide load balancing, as these
> +modes may still try to send out a subset of packets via the failed interface.
> +In case of asymmetric connectivity, Corosync may not be able to form a stable
> +quorum in the cluster.
--- here
> If this state persists and HA is enabled, nodes may
> +fence themselves, even if their respective bond is still fully functioning
---
> . In
> +the worst case, the whole cluster may fence itself.
> +
> +We recommend at least one dedicated physical NIC for the primary Corosync link,
> +see xref:pvecm_cluster_requirements[Requirements]. Bonds may be used as
> +additional links for increased redundancy. To avoid fencing in the failure
> +scenario outlined above, the following caveats apply whenever a bond is used
> +for Corosync traffic:
> +
> +* We *advise against* using bond modes *balance-rr*, *balance-xor*,
> +  *balance-tlb*, or *balance-alb* for Corosync traffic. As explained above,
> +  they can cause asymmetric connectivity in certain failure scenarios.
> +
> +* *IEEE 802.3ad (LACP)*: This bond mode can cause asymmetric connectivity in
> +  certain failure scenarios as explained above, but it can recover from this
> +  state, as each side of the bond (Proxmox VE node and switch) can stop using a
> +  bonded interface if it has not received three LACPDUs in a row on it.
> +  However, with default settings, LACPDUs are only sent every 30 seconds,
> +  yielding a failover time of 90 seconds. This is too long, as nodes with HA
> +  resources will fence themselves already after roughly one minute without a
> +  stable quorum. If LACP bonds are used for corosync traffic, we recommend
> +  setting `bond-lacp-rate fast` *on the Proxmox VE node and the switch*!
> +  Setting this option on one side requests the other side to send an LACPDU
> +  every second. Setting this option on both sides can reduce the failover time
> +  in the scenario above to 3 seconds and thus prevent fencing.
> +
> +* Bond mode *active-backup* will not cause asymmetric connectivity in the
> +  failure scenario described above. The node whose bond experienced the failure
> +  may lose connection to the cluster and, if HA is enabled, fence itself.
> +
>   Separate Cluster Network
>   ~~~~~~~~~~~~~~~~~~~~~~~~
>   
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH docs v3] pvecm, network: add section on corosync over bonds
  2025-07-28 16:16 ` Hannes Duerr
@ 2025-07-29 16:25   ` Friedrich Weber
  0 siblings, 0 replies; 4+ messages in thread
From: Friedrich Weber @ 2025-07-29 16:25 UTC (permalink / raw)
  To: Hannes Duerr, Proxmox VE development discussion

Thanks for taking a look!

Discussed with HD off-list:

- having the justification for the recommendations in the docs is good
- but since the justification somewhat complex, probably not good to
have it directly in the beginning of the new section.
- It might be better to have the recommendations first ("We recommend
[...]" plus the list of bond modes), and the justification below that,
so readers immediately see the important part, and can optionally still
read about the justification.

Hence I'll send v3 that rearranges the paragraphs.

On 28/07/2025 18:16, Hannes Duerr wrote:
> 
> On 7/25/25 4:03 PM, Friedrich Weber wrote:
>> +Corosync Over Bonds
>> +~~~~~~~~~~~~~~~~~~~
>> +
>> +Using a xref:sysadmin_network_bond[bond] as a Corosync link can be problematic
>> +in certain failure scenarios. If one of the bonded interfaces fails and stops
>> +transmitting packets, but its link state stays up, and there are no other
>> +Corosync links available
> I thought it can also occur if the are still other Corosync links available?

In my tests so far, it didn't. Even if the bond is the primary corosync
link, as long as there is still a fallback link available, corosync
seems to simply switch over to the fallback link.

Here 172.16.0.0/24 is the LACP-bonded network, and I stopped traffic on
one bonded NIC of node 2. corosync just says:

On node 1:

Jul 29 11:31:39 pve1 corosync[841]:   [KNET  ] link: host: 2 link: 0 is down
Jul 29 11:31:39 pve1 corosync[841]:   [KNET  ] host: host: 2 (passive)
best link: 1 (pri: 1)
Jul 29 11:31:39 pve1 corosync[841]:   [KNET  ] host: host: 2 (passive)
best link: 1 (pri: 1)

On node 2:

Jul 29 11:31:39 pve2 corosync[837]:   [KNET  ] link: host: 4 link: 0 is down
Jul 29 11:31:39 pve2 corosync[837]:   [KNET  ] link: host: 1 link: 0 is down
Jul 29 11:31:39 pve2 corosync[837]:   [KNET  ] host: host: 4 (passive)
best link: 1 (pri: 1)
Jul 29 11:31:39 pve2 corosync[837]:   [KNET  ] host: host: 1 (passive)
best link: 1 (pri: 1)

And nothing on the other nodes.

corosync-cfgtool reports the following on the four nodes, note that only
the 1<->2 link is not "connected":

Local node ID 1, transport knet
nodeid: 2 reachable
   LINK: 0 udp (172.16.0.101->172.16.0.102) enabled mtu: 1397
   LINK: 1 udp (192.168.0.101->192.168.0.102) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.0.101->172.16.0.103) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.101->192.168.0.103) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.0.101->172.16.0.104) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.101->192.168.0.104) enabled connected mtu: 1397

Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.0.102->172.16.0.101) enabled mtu: 1397
   LINK: 1 udp (192.168.0.102->192.168.0.101) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.0.102->172.16.0.103) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.102->192.168.0.103) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.0.102->172.16.0.104) enabled mtu: 1397
   LINK: 1 udp (192.168.0.102->192.168.0.104) enabled connected mtu: 1397

Local node ID 3, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.0.103->172.16.0.101) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.103->192.168.0.101) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (172.16.0.103->172.16.0.102) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.103->192.168.0.102) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (172.16.0.103->172.16.0.104) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.103->192.168.0.104) enabled connected mtu: 1397

Local node ID 4, transport knet
nodeid: 1 reachable
   LINK: 0 udp (172.16.0.104->172.16.0.101) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.104->192.168.0.101) enabled connected mtu: 1397

nodeid: 2 reachable
   LINK: 0 udp (172.16.0.104->172.16.0.102) enabled mtu: 1397
   LINK: 1 udp (192.168.0.104->192.168.0.102) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (172.16.0.104->172.16.0.103) enabled connected mtu: 1397
   LINK: 1 udp (192.168.0.104->192.168.0.103) enabled connected mtu: 1397

With `bond-lacp-rate slow`, this switches over to "connected" for all four
interfaces after ~90 seconds.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] superseded: [PATCH docs v3] pvecm, network: add section on corosync over bonds
  2025-07-25 14:01 [pve-devel] [PATCH docs v3] pvecm, network: add section on corosync over bonds Friedrich Weber
  2025-07-28 16:16 ` Hannes Duerr
@ 2025-07-30  8:59 ` Friedrich Weber
  1 sibling, 0 replies; 4+ messages in thread
From: Friedrich Weber @ 2025-07-30  8:59 UTC (permalink / raw)
  To: pve-devel

Superseded by:
https://lore.proxmox.com/pve-devel/20250730085836.147270-1-f.weber@proxmox.com/


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-30  8:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-25 14:01 [pve-devel] [PATCH docs v3] pvecm, network: add section on corosync over bonds Friedrich Weber
2025-07-28 16:16 ` Hannes Duerr
2025-07-29 16:25   ` Friedrich Weber
2025-07-30  8:59 ` [pve-devel] superseded: " Friedrich Weber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal