From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 798F61FF168 for ; Tue, 7 Jan 2025 12:17:44 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6F31E1781; Tue, 7 Jan 2025 12:17:26 +0100 (CET) X-Envelope-From: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=elettra.eu; s=esgkey1; t=1736248015; bh=oZ0vRWiVUaB+i7yXvYbIkb9v8arvt5+hJ147zkCF4e0=; h=Date:From:Subject:To:From; b=K/NmIhemJp+cqouqfjmFR7+rgobqowHkO2IAhnd7EwtVvRNlEv+DfxFa728Oy/YC6 a2xYBF8mqI31N6Zn69lMyc3U0jgKS1Acjx/AxIfgxbZW1D1ZM0HocHmYhoeg2fP4if 4wFv/l7fkwYay4SK+/Z9wY3UirtVfh46lVApcnG8= X-Virus-Scanned: amavis at zmp.elettra.eu Message-ID: <17af1712-1aa7-4f72-bd90-1e45d1361e45@elettra.eu> Date: Tue, 7 Jan 2025 12:06:54 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Iztok Gregori Content-Language: it, en-US To: Proxmox VE user list X-elettra-Libra-ESVA-Information: Please contact elettra for more information X-elettra-Libra-ESVA-ID: 4YS7Xv1TnbzBryN X-elettra-Libra-ESVA: No virus found X-elettra-Libra-ESVA-From: iztok.gregori@elettra.eu X-elettra-Libra-ESVA-Watermark: 1736852815.30263@pAgJOzPchtGALIa5I0Wo3A X-SPAM-LEVEL: Spam detection results: 0 AWL -0.046 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain DMARC_PASS -0.1 DMARC pass policy KAM_EU 0.5 Prevalent use of .eu in spam/malware RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [PVE-User] Corosync and Cluster reboot X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE user list Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: pve-user-bounces@lists.proxmox.com Sender: "pve-user" Hi to all! I need some help to understand a situation (cluster reboot) which happened to us previous week. We are running a 17 nodes Proxmox cluster with a separate Ceph cluster for storage (no hyper-convergence). We have to upgrade a stack a 2 switches and in order to avoid any downtime we decided to prepare a new (temporary) stack and move the links from one switch to the other. Our procedure was the following: - Migrate all the VM from node. - Unplug the links from the old switch. - Plug the links to the temporary switch. - Wait till the node is available again in the cluster. - Repeat. We have to move 8 nodes from one switch to the other. The first 4 nodes went smoothly, but when we did plug the 5th node into the new switch ALL the nodes which have configured HA VMs rebooted! From the Corosync logs I see that the Token wasn't received and because of that watchdog-mux wasn't updated causing the node reboot. Here are the Corosync logs during the procedure and before the nodes restarted. It was captured from a node which didn't reboot (pve-ha-lrm: idle): > 12:51:57 [KNET ] link: host: 18 link: 0 is down > 12:51:57 [KNET ] host: host: 18 (passive) best link: 0 (pri: 1) > 12:51:57 [KNET ] host: host: 18 has no active links > 12:52:02 [TOTEM ] Token has not been received in 9562 ms > 12:52:16 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 19 > 12:52:16 [QUORUM] Sync left[1]: 18 > 12:52:16 [TOTEM ] A new membership (1.d29) was formed. Members left: 18 > 12:52:16 [TOTEM ] Failed to receive the leave message. failed: 18 > 12:52:16 [QUORUM] Members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 19 > 12:52:16 [MAIN ] Completed service synchronization, ready to provide service. > 12:52:42 [KNET ] rx: host: 18 link: 0 is up > 12:52:42 [KNET ] host: host: 18 (passive) best link: 0 (pri: 1) > 12:52:50 [TOTEM ] Token has not been received in 9567 ms > 12:53:01 [TOTEM ] Token has not been received in 20324 ms > 12:53:11 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 19 > 12:53:11 [TOTEM ] A new membership (1.d35) was formed. Members > 12:53:20 [TOTEM ] Token has not been received in 9570 ms > 12:53:31 [TOTEM ] Token has not been received in 20326 ms > 12:53:41 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 19 > 12:53:41 [TOTEM ] A new membership (1.d41) was formed. Members > 12:53:50 [TOTEM ] Token has not been received in 9570 ms And here you can find the logs of a successfully completed "procedure": > 12:19:12 [KNET ] link: host: 19 link: 0 is down > 12:19:12 [KNET ] host: host: 19 (passive) best link: 0 (pri: 1) > 12:19:12 [KNET ] host: host: 19 has no active links > 12:19:17 [TOTEM ] Token has not been received in 9562 ms > 12:19:31 [QUORUM] Sync members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 18 > 12:19:31 [QUORUM] Sync left[1]: 19 > 12:19:31 [TOTEM ] A new membership (1.d21) was formed. Members left: 19 > 12:19:31 [TOTEM ] Failed to receive the leave message. failed: 19 > 12:19:31 [QUORUM] Members[16]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 18 > 12:19:31 [MAIN ] Completed service synchronization, ready to provide service. > 12:19:47 [KNET ] rx: host: 19 link: 0 is up > 12:19:47 [KNET ] host: host: 19 (passive) best link: 0 (pri: 1) > 12:19:50 [QUORUM] Sync members[17]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 18 19 > 12:19:50 [QUORUM] Sync joined[1]: 19 > 12:19:50 [TOTEM ] A new membership (1.d25) was formed. Members joined: 19 > 12:19:51 [QUORUM] Members[17]: 1 2 3 4 7 8 9 10 11 12 13 14 15 16 17 18 19 > 12:19:51 [MAIN ] Completed service synchronization, ready to provide service. Comparing the 2 logs I can see that after the "host: 18" link was found active again the token was not received, but I cannot figure out what went different in this case. I have 2 possible culprits: 1. NETWORK The cluster network is backed up with 5 Extreme Networks switches, 3 stacks of two x870 (100GBE), 1 stack of two x770 (40GBE) and one temporary stack of two 7720-32C (100GBE). The switches are linked together by a 2x LACP bond, and the 99% of the cluster communication are on 100GBE. The hosts are connected to the network with different speed interfaces: 10GBE (1 node), 25GBE (4 nodes), 40GBE (1 node), 100GBE (11 nodes). All the nodes are bonded, the Corosync network (is the same as the management one) is defined on a bridge interface on the bonded link (configuration is almost the same on all nodes, some older ones have balance-xor the other have lacp as bonding mode). Is it possible that there is something wrong with the network, but I cannot find a probable cause. From the data that I have, I don't see nothing special, no links were saturated, no error logged... 2. COROSYNC The cluster is running a OLD version of Proxmox (7.1-12) with Corosync 3.1.5-pve2. Is possible that there is a problem in Corosync fixed in a later release. I did a quick search but I didn't found anything. The cluster upgrade is on my to-do list (but the list is huge, so it will not be done tomorrow). We are running only one Corosync network which is the same as the management/migration one, but different from the one for client/storage/backup. The configuration is very basic, I think is the default one, I can provide it if needed. I checked the Corosync stats and the average latency is around 150 (microseconds?) along all links on all nodes. ==== In general it can be a combination of the 2 above or something completely different. Do you have some advice on where to look to debug further? I can provide more information if needed. Thanks a lot! Iztok -- Iztok Gregori ICT Systems and Services Elettra - Sincrotrone Trieste S.C.p.A. Telephone: +39 040 3758948 http://www.elettra.eu _______________________________________________ pve-user mailing list pve-user@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user