From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id E8A76618F7 for ; Mon, 7 Sep 2020 11:32:45 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id DECA09701 for ; Mon, 7 Sep 2020 11:32:15 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id C26D596F7 for ; Mon, 7 Sep 2020 11:32:14 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id CD2F318BF832; Mon, 7 Sep 2020 11:32:13 +0200 (CEST) Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id DMtZzFICHD2d; Mon, 7 Sep 2020 11:32:13 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id B3C0518BF837; Mon, 7 Sep 2020 11:32:13 +0200 (CEST) X-Virus-Scanned: amavisd-new at mailpro.odiso.com Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id k0IvIpfTOOpH; Mon, 7 Sep 2020 11:32:13 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111]) by mailpro.odiso.net (Postfix) with ESMTP id 9F15118BF832; Mon, 7 Sep 2020 11:32:13 +0200 (CEST) Date: Mon, 7 Sep 2020 11:32:13 +0200 (CEST) From: Alexandre DERUMIER To: dietmar Cc: Proxmox VE development discussion Message-ID: <1066029576.414316.1599471133463.JavaMail.zimbra@odiso.com> In-Reply-To: <72727125.827.1599466723564@webmail.proxmox.com> References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <1710924670.385348.1599327014568.JavaMail.zimbra@odiso.com> <469910091.758.1599366116137@webmail.proxmox.com> <570223166.391607.1599370570342.JavaMail.zimbra@odiso.com> <1059698258.392627.1599381816979.JavaMail.zimbra@odiso.com> <508219973.775.1599394447359@webmail.proxmox.com> <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com> <72727125.827.1599466723564@webmail.proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844) Thread-Topic: corosync bug: cluster break after 1 node clean shutdown Thread-Index: +rV+z1vk4ZrVyeCxyKMurD4FEuQglw== X-SPAM-LEVEL: Spam detection results: 0 AWL 0.022 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Sep 2020 09:32:46 -0000 >>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-= proxmox-is-not-stable.75506/#post-336111=20 >> >>No HA involved...=20 I had already help this user some week ago https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboo= t-need-help.74643/#post-333093 HA was actived at this time. (Maybe the watchdog was still running, I'm not= sure if you disable HA from all vms if LRM disable the watchdog ?) ----- Mail original ----- De: "dietmar" =C3=80: "aderumier" Cc: "Proxmox VE development discussion" Envoy=C3=A9: Lundi 7 Septembre 2020 10:18:42 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own There is a similar report in the forum:=20 https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-pr= oxmox-is-not-stable.75506/#post-336111=20 No HA involved...=20 > On 09/07/2020 9:19 AM Alexandre DERUMIER wrote:=20 >=20 >=20 > >>Indeed, this should not happen. Do you use a spearate network for coros= ync?=20 >=20 > No, I use 2x40GB lacp link.=20 >=20 > >>was there high traffic on the network?=20 >=20 > but I'm far from saturated them. (in pps or througput), (I'm around 3-4gb= ps)=20 >=20 >=20 > The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)= =20 >=20 >=20 > From my understanding, watchdog-mux was still runing as the watchdog have= reset only after 1min and not 10s,=20 > so it's like the lrm was blocked and not sending watchdog timer reset to = watchdog-mux.=20 >=20 >=20 > I'll do tests with softdog + soft_noboot=3D1, so if that happen again,I'l= l able to debug.=20 >=20 >=20 >=20 > >>What kind of maintenance was the reason for the shutdown?=20 >=20 > ram upgrade. (the server was running ok before shutdown, no hardware prob= lem)=20 > (I just shutdown the server, and don't have started it yet when problem o= ccur)=20 >=20 >=20 >=20 > >>Do you use the default corosync timeout values, or do you have a specia= l setup?=20 >=20 >=20 > no special tuning, default values. (I don't have any retransmit since mon= ths in the logs)=20 >=20 > >>Can you please post the full corosync config?=20 >=20 > (I have verified, the running version was corosync was 3.0.3 with libknet= 1.15)=20 >=20 >=20 > here the config:=20 >=20 > "=20 > logging {=20 > debug: off=20 > to_syslog: yes=20 > }=20 >=20 > nodelist {=20 > node {=20 > name: m6kvm1=20 > nodeid: 1=20 > quorum_votes: 1=20 > ring0_addr: m6kvm1=20 > }=20 > node {=20 > name: m6kvm10=20 > nodeid: 10=20 > quorum_votes: 1=20 > ring0_addr: m6kvm10=20 > }=20 > node {=20 > name: m6kvm11=20 > nodeid: 11=20 > quorum_votes: 1=20 > ring0_addr: m6kvm11=20 > }=20 > node {=20 > name: m6kvm12=20 > nodeid: 12=20 > quorum_votes: 1=20 > ring0_addr: m6kvm12=20 > }=20 > node {=20 > name: m6kvm13=20 > nodeid: 13=20 > quorum_votes: 1=20 > ring0_addr: m6kvm13=20 > }=20 > node {=20 > name: m6kvm14=20 > nodeid: 14=20 > quorum_votes: 1=20 > ring0_addr: m6kvm14=20 > }=20 > node {=20 > name: m6kvm2=20 > nodeid: 2=20 > quorum_votes: 1=20 > ring0_addr: m6kvm2=20 > }=20 > node {=20 > name: m6kvm3=20 > nodeid: 3=20 > quorum_votes: 1=20 > ring0_addr: m6kvm3=20 > }=20 > node {=20 > name: m6kvm4=20 > nodeid: 4=20 > quorum_votes: 1=20 > ring0_addr: m6kvm4=20 > }=20 > node {=20 > name: m6kvm5=20 > nodeid: 5=20 > quorum_votes: 1=20 > ring0_addr: m6kvm5=20 > }=20 > node {=20 > name: m6kvm6=20 > nodeid: 6=20 > quorum_votes: 1=20 > ring0_addr: m6kvm6=20 > }=20 > node {=20 > name: m6kvm7=20 > nodeid: 7=20 > quorum_votes: 1=20 > ring0_addr: m6kvm7=20 > }=20 >=20 > node {=20 > name: m6kvm8=20 > nodeid: 8=20 > quorum_votes: 1=20 > ring0_addr: m6kvm8=20 > }=20 > node {=20 > name: m6kvm9=20 > nodeid: 9=20 > quorum_votes: 1=20 > ring0_addr: m6kvm9=20 > }=20 > }=20 >=20 > quorum {=20 > provider: corosync_votequorum=20 > }=20 >=20 > totem {=20 > cluster_name: m6kvm=20 > config_version: 19=20 > interface {=20 > bindnetaddr: 10.3.94.89=20 > ringnumber: 0=20 > }=20 > ip_version: ipv4=20 > secauth: on=20 > transport: knet=20 > version: 2=20 > }=20 >=20 >=20 >=20 > ----- Mail original -----=20 > De: "dietmar" =20 > =C3=80: "aderumier" , "Proxmox VE development discus= sion" =20 > Cc: "pve-devel" =20 > Envoy=C3=A9: Dimanche 6 Septembre 2020 14:14:06=20 > Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shu= tdown=20 >=20 > > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds= )=20 > > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds= )=20 >=20 > Indeed, this should not happen. Do you use a spearate network for corosyn= c? Or=20 > was there high traffic on the network? What kind of maintenance was the r= eason=20 > for the shutdown?=20