From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 7705D61911 for ; Mon, 7 Sep 2020 10:19:58 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6371088E6 for ; Mon, 7 Sep 2020 10:19:28 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 5BA6D88DC for ; Mon, 7 Sep 2020 10:19:27 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 21DA944A7D; Mon, 7 Sep 2020 10:19:27 +0200 (CEST) Date: Mon, 7 Sep 2020 10:18:42 +0200 (CEST) From: dietmar To: Alexandre DERUMIER Cc: Proxmox VE development discussion Message-ID: <72727125.827.1599466723564@webmail.proxmox.com> In-Reply-To: <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com> References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <1667839988.383835.1599312761359.JavaMail.zimbra@odiso.com> <665305060.757.1599319409105@webmail.proxmox.com> <1710924670.385348.1599327014568.JavaMail.zimbra@odiso.com> <469910091.758.1599366116137@webmail.proxmox.com> <570223166.391607.1599370570342.JavaMail.zimbra@odiso.com> <1059698258.392627.1599381816979.JavaMail.zimbra@odiso.com> <508219973.775.1599394447359@webmail.proxmox.com> <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 Importance: Normal X-Mailer: Open-Xchange Mailer v7.10.3-Rev21 X-Originating-Client: open-xchange-appsuite X-SPAM-LEVEL: Spam detection results: 0 AWL 0.094 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Sep 2020 08:19:58 -0000 There is a similar report in the forum: https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-pr= oxmox-is-not-stable.75506/#post-336111 No HA involved... > On 09/07/2020 9:19 AM Alexandre DERUMIER wrote: >=20 > =20 > >>Indeed, this should not happen. Do you use a spearate network for coros= ync?=20 >=20 > No, I use 2x40GB lacp link.=20 >=20 > >>was there high traffic on the network?=20 >=20 > but I'm far from saturated them. (in pps or througput), (I'm around 3-4g= bps) >=20 >=20 > The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) >=20 >=20 > From my understanding, watchdog-mux was still runing as the watchdog have= reset only after 1min and not 10s, > so it's like the lrm was blocked and not sending watchdog timer reset to= watchdog-mux. >=20 >=20 > I'll do tests with softdog + soft_noboot=3D1, so if that happen again,I'l= l able to debug. >=20 >=20 >=20 > >>What kind of maintenance was the reason for the shutdown? >=20 > ram upgrade. (the server was running ok before shutdown, no hardware prob= lem) =20 > (I just shutdown the server, and don't have started it yet when problem o= ccur) >=20 >=20 >=20 > >>Do you use the default corosync timeout values, or do you have a specia= l setup? >=20 >=20 > no special tuning, default values. (I don't have any retransmit since mon= ths in the logs) >=20 > >>Can you please post the full corosync config? >=20 > (I have verified, the running version was corosync was 3.0.3 with libknet= 1.15) >=20 >=20 > here the config: >=20 > " > logging { > debug: off > to_syslog: yes > } >=20 > nodelist { > node { > name: m6kvm1 > nodeid: 1 > quorum_votes: 1 > ring0_addr: m6kvm1 > } > node { > name: m6kvm10 > nodeid: 10 > quorum_votes: 1 > ring0_addr: m6kvm10 > } > node { > name: m6kvm11 > nodeid: 11 > quorum_votes: 1 > ring0_addr: m6kvm11 > } > node { > name: m6kvm12 > nodeid: 12 > quorum_votes: 1 > ring0_addr: m6kvm12 > } > node { > name: m6kvm13 > nodeid: 13 > quorum_votes: 1 > ring0_addr: m6kvm13 > } > node { > name: m6kvm14 > nodeid: 14 > quorum_votes: 1 > ring0_addr: m6kvm14 > } > node { > name: m6kvm2 > nodeid: 2 > quorum_votes: 1 > ring0_addr: m6kvm2 > } > node { > name: m6kvm3 > nodeid: 3 > quorum_votes: 1 > ring0_addr: m6kvm3 > } > node { > name: m6kvm4 > nodeid: 4 > quorum_votes: 1 > ring0_addr: m6kvm4 > } > node { > name: m6kvm5 > nodeid: 5 > quorum_votes: 1 > ring0_addr: m6kvm5 > } > node { > name: m6kvm6 > nodeid: 6 > quorum_votes: 1 > ring0_addr: m6kvm6 > } > node { > name: m6kvm7 > nodeid: 7 > quorum_votes: 1 > ring0_addr: m6kvm7 > } >=20 > node { > name: m6kvm8 > nodeid: 8 > quorum_votes: 1 > ring0_addr: m6kvm8 > } > node { > name: m6kvm9 > nodeid: 9 > quorum_votes: 1 > ring0_addr: m6kvm9 > } > } >=20 > quorum { > provider: corosync_votequorum > } >=20 > totem { > cluster_name: m6kvm > config_version: 19 > interface { > bindnetaddr: 10.3.94.89 > ringnumber: 0 > } > ip_version: ipv4 > secauth: on > transport: knet > version: 2 > } >=20 >=20 >=20 > ----- Mail original ----- > De: "dietmar" > =C3=80: "aderumier" , "Proxmox VE development discus= sion" > Cc: "pve-devel" > Envoy=C3=A9: Dimanche 6 Septembre 2020 14:14:06 > Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shu= tdown >=20 > > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds= )=20 > > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds= )=20 >=20 > Indeed, this should not happen. Do you use a spearate network for corosyn= c? Or=20 > was there high traffic on the network? What kind of maintenance was the r= eason=20 > for the shutdown?