From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id C744561893 for ; Mon, 7 Sep 2020 09:19:48 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B7AE52F937 for ; Mon, 7 Sep 2020 09:19:48 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id A7CA82F928 for ; Mon, 7 Sep 2020 09:19:47 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 461EE18BEA73; Mon, 7 Sep 2020 09:19:41 +0200 (CEST) Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id zMs5d35ZWWsW; Mon, 7 Sep 2020 09:19:41 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 29D5A18BEA74; Mon, 7 Sep 2020 09:19:41 +0200 (CEST) X-Virus-Scanned: amavisd-new at mailpro.odiso.com Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ML4VhS8GHGw1; Mon, 7 Sep 2020 09:19:41 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111]) by mailpro.odiso.net (Postfix) with ESMTP id 0F3C418BEA73; Mon, 7 Sep 2020 09:19:41 +0200 (CEST) Date: Mon, 7 Sep 2020 09:19:40 +0200 (CEST) From: Alexandre DERUMIER To: dietmar Cc: Proxmox VE development discussion , pve-devel Message-ID: <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com> In-Reply-To: <508219973.775.1599394447359@webmail.proxmox.com> References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <1667839988.383835.1599312761359.JavaMail.zimbra@odiso.com> <665305060.757.1599319409105@webmail.proxmox.com> <1710924670.385348.1599327014568.JavaMail.zimbra@odiso.com> <469910091.758.1599366116137@webmail.proxmox.com> <570223166.391607.1599370570342.JavaMail.zimbra@odiso.com> <1059698258.392627.1599381816979.JavaMail.zimbra@odiso.com> <508219973.775.1599394447359@webmail.proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844) Thread-Topic: corosync bug: cluster break after 1 node clean shutdown Thread-Index: 0qT4XY3jp5q5wdhhQDWw652hQpim1A== X-SPAM-LEVEL: Spam detection results: 0 AWL 0.023 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Sep 2020 07:19:48 -0000 >>Indeed, this should not happen. Do you use a spearate network for corosyn= c?=20 No, I use 2x40GB lacp link.=20 >>was there high traffic on the network?=20 but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbp= s) The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) >From my understanding, watchdog-mux was still runing as the watchdog have r= eset only after 1min and not 10s, so it's like the lrm was blocked and not sending watchdog timer reset to w= atchdog-mux. I'll do tests with softdog + soft_noboot=3D1, so if that happen again,I'll = able to debug. >>What kind of maintenance was the reason for the shutdown? ram upgrade. (the server was running ok before shutdown, no hardware proble= m) =20 (I just shutdown the server, and don't have started it yet when problem occ= ur) >>Do you use the default corosync timeout values, or do you have a special = setup? no special tuning, default values. (I don't have any retransmit since month= s in the logs) >>Can you please post the full corosync config? (I have verified, the running version was corosync was 3.0.3 with libknet 1= .15) here the config: " logging { debug: off to_syslog: yes } nodelist { node { name: m6kvm1 nodeid: 1 quorum_votes: 1 ring0_addr: m6kvm1 } node { name: m6kvm10 nodeid: 10 quorum_votes: 1 ring0_addr: m6kvm10 } node { name: m6kvm11 nodeid: 11 quorum_votes: 1 ring0_addr: m6kvm11 } node { name: m6kvm12 nodeid: 12 quorum_votes: 1 ring0_addr: m6kvm12 } node { name: m6kvm13 nodeid: 13 quorum_votes: 1 ring0_addr: m6kvm13 } node { name: m6kvm14 nodeid: 14 quorum_votes: 1 ring0_addr: m6kvm14 } node { name: m6kvm2 nodeid: 2 quorum_votes: 1 ring0_addr: m6kvm2 } node { name: m6kvm3 nodeid: 3 quorum_votes: 1 ring0_addr: m6kvm3 } node { name: m6kvm4 nodeid: 4 quorum_votes: 1 ring0_addr: m6kvm4 } node { name: m6kvm5 nodeid: 5 quorum_votes: 1 ring0_addr: m6kvm5 } node { name: m6kvm6 nodeid: 6 quorum_votes: 1 ring0_addr: m6kvm6 } node { name: m6kvm7 nodeid: 7 quorum_votes: 1 ring0_addr: m6kvm7 } node { name: m6kvm8 nodeid: 8 quorum_votes: 1 ring0_addr: m6kvm8 } node { name: m6kvm9 nodeid: 9 quorum_votes: 1 ring0_addr: m6kvm9 } } quorum { provider: corosync_votequorum } totem { cluster_name: m6kvm config_version: 19 interface { bindnetaddr: 10.3.94.89 ringnumber: 0 } ip_version: ipv4 secauth: on transport: knet version: 2 } ----- Mail original ----- De: "dietmar" =C3=80: "aderumier" , "Proxmox VE development discussi= on" Cc: "pve-devel" Envoy=C3=A9: Dimanche 6 Septembre 2020 14:14:06 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)= =20 > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)= =20 Indeed, this should not happen. Do you use a spearate network for corosync?= Or=20 was there high traffic on the network? What kind of maintenance was the rea= son=20 for the shutdown?=20