From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 2FBB61FF183 for ; Wed, 30 Jul 2025 10:57:50 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id E1CDC36E79; Wed, 30 Jul 2025 10:59:13 +0200 (CEST) From: Friedrich Weber To: pve-devel@lists.proxmox.com Date: Wed, 30 Jul 2025 10:58:08 +0200 Message-ID: <20250730085836.147270-1-f.weber@proxmox.com> X-Mailer: git-send-email 2.47.2 MIME-Version: 1.0 X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1753865909878 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.011 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] [PATCH docs v4] pvecm, network: add section on corosync over bonds X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE development discussion Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" Testing has shown that running corosync (only) over a bond can be problematic in some failure scenarios and for certain bond modes. The documentation only discourages bonds for corosync because corosync can switch between available networks itself, but does not mention other caveats when using bonds for corosync. Hence, extend the documentation with recommendations and caveats regarding bonds for corosync. Signed-off-by: Friedrich Weber --- Notes: Aaron suggested we could expose the bond-lacp-rate in the GUI to make it easier to change the setting on the PVE side. I'd open a feature report for this. Changes since v3: - describe recommendations first, and further details for interested readers below. Consequently, rephrase failure scenario description (thx HD!) Changes since v2: - fix wording in the failure scenario description - explain that load-balancing bond modes are affected and why - clarify that the caveats apply whenver a bond is used for Corosync traffic (even if only as a redundant link) Changes since v1: - move to its own section under "Cluster Network" - reword remarks about bond-lacp-rate fast - reword remark under "Requirements" pve-network.adoc | 4 ++- pvecm.adoc | 68 +++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 67 insertions(+), 5 deletions(-) diff --git a/pve-network.adoc b/pve-network.adoc index 2dec882..b361f97 100644 --- a/pve-network.adoc +++ b/pve-network.adoc @@ -495,7 +495,9 @@ use the active-backup mode. For the cluster network (Corosync) we recommend configuring it with multiple networks. Corosync does not need a bond for network redundancy as it can switch -between networks by itself, if one becomes unusable. +between networks by itself, if one becomes unusable. Some bond modes are known +to be problematic for Corosync, see +xref:pvecm_corosync_over_bonds[Corosync over Bonds]. The following bond configuration can be used as distributed/shared storage network. The benefit would be that you get more speed and the diff --git a/pvecm.adoc b/pvecm.adoc index 312a26f..3af1a06 100644 --- a/pvecm.adoc +++ b/pvecm.adoc @@ -89,10 +89,8 @@ NOTE: To ensure reliable Corosync redundancy, it is essential to have at least another link on a different physical network. This enables Corosync to keep the cluster communication alive should the dedicated network be down. + -NOTE: A single link backed by a bond is not enough to provide Corosync -redundancy. When a bonded interface fails and Corosync cannot fall back to -another link, it can lead to asymmetric communication in the cluster, which in -turn can lead to the cluster losing quorum. +NOTE: A single link backed by a bond can be problematic in certain failure +scenarios, see xref:pvecm_corosync_over_bonds[Corosync Over Bonds]. * The root password of a cluster node is required for adding nodes. @@ -606,6 +604,68 @@ transport to `udp` or `udpu` in your xref:pvecm_edit_corosync_conf[corosync.conf but keep in mind that this will disable all cryptography and redundancy support. This is therefore not recommended. +[[pvecm_corosync_over_bonds]] +Corosync Over Bonds +~~~~~~~~~~~~~~~~~~~ + +Recommendations +^^^^^^^^^^^^^^^ + +We recommend at least one dedicated physical NIC for the primary Corosync link, +see xref:pvecm_cluster_requirements[Requirements]. +xref:sysadmin_network_bond[Bonds] may be used as additional links for increased +redundancy. The following caveats apply *whenever a bond is used for Corosync +traffic*: + +* Bond mode *active-backup* may not provide the expected redundancy in certain + failure scenarios, see below for details. + +* We *advise against* using bond modes *balance-rr*, *balance-xor*, + *balance-tlb*, or *balance-alb* for Corosync traffic. They are known to be + problematic in certain failure scenarios, see below for details. + +* *IEEE 802.3ad (LACP)*: If LACP bonds are used for corosync traffic, we + strongly recommend setting `bond-lacp-rate fast` *on the Proxmox VE node and + the switch*! With the default setting `bond-lacp-rate slow`, this mode is + known to be problematic in certain failure scenarios, see below for details. + +Background +^^^^^^^^^^ + +Using a xref:sysadmin_network_bond[bond] as a Corosync link can be problematic +in certain failure scenarios. Consider the failure scenario where one of the +bonded interfaces fails and stops transmitting packets, but its link state +stays up, and there are no other Corosync links available. In this scenario, +some bond modes may cause a state of asymmetric connectivity where cluster +nodes can only communicate with different subsets of other nodes. Affected are +bond modes that provide load balancing, as these modes may still try to send +out a subset of packets via the failed interface. In case of asymmetric +connectivity, Corosync may not be able to form a stable quorum in the cluster. +If this state persists and HA is enabled, even nodes whose bond does not have +any issues may fence themselves. In the worst case, the whole cluster may fence +itself. + +The bond mode *active-backup* will not cause asymmetric connectivity in the +failure scenario described above. However, the bond with the interface failure +may not switch over to the backup link. The node may lose connection to the +cluster and, if HA is enabled, fence itself. + +Bond modes *balance-rr*, *balance-xor*, *balance_tlb*, or *balance-alb* may +cause asymmetric connectivity in the failure scenario above, which can lead to +unexpected fencing if HA is enabled. + +Bond mode *IEEE 802.3ad (LACP)* can cause asymmetric connectivity in the +failure scenario above, but it can recover from this state, as each side of the +bond (Proxmox VE node and switch) can stop using a bonded interface if it has +not received three LACPDUs in a row on it. However, with default settings, +LACPDUs are only sent every 30 seconds, yielding a failover time of 90 seconds. +This is too long, as nodes with HA resources will fence themselves already +after roughly one minute without a stable quorum. If LACP bonds are used for +corosync traffic, we recommend setting `bond-lacp-rate fast` on the Proxmox VE +node and the switch! Setting this option on one side requests the other side to +send an LACPDU every second. Setting this option on both sides can reduce the +failover time in the scenario above to 3 seconds and thus prevent fencing. + Separate Cluster Network ~~~~~~~~~~~~~~~~~~~~~~~~ -- 2.47.2 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel