From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id C87F861A52 for ; Mon, 7 Sep 2020 15:23:59 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B6CEEB3C9 for ; Mon, 7 Sep 2020 15:23:29 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 6DAF1B3BE for ; Mon, 7 Sep 2020 15:23:28 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 72F1B18D9404; Mon, 7 Sep 2020 15:23:26 +0200 (CEST) Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id uJR_i7sAOHg0; Mon, 7 Sep 2020 15:23:26 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 5808418D9405; Mon, 7 Sep 2020 15:23:26 +0200 (CEST) X-Virus-Scanned: amavisd-new at mailpro.odiso.com Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id cQkJC-X4F3Dd; Mon, 7 Sep 2020 15:23:26 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111]) by mailpro.odiso.net (Postfix) with ESMTP id 3F3A718D9404; Mon, 7 Sep 2020 15:23:26 +0200 (CEST) Date: Mon, 7 Sep 2020 15:23:26 +0200 (CEST) From: Alexandre DERUMIER To: Proxmox VE development discussion Message-ID: <872332597.423950.1599485006085.JavaMail.zimbra@odiso.com> In-Reply-To: <1066029576.414316.1599471133463.JavaMail.zimbra@odiso.com> References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <469910091.758.1599366116137@webmail.proxmox.com> <570223166.391607.1599370570342.JavaMail.zimbra@odiso.com> <1059698258.392627.1599381816979.JavaMail.zimbra@odiso.com> <508219973.775.1599394447359@webmail.proxmox.com> <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com> <72727125.827.1599466723564@webmail.proxmox.com> <1066029576.414316.1599471133463.JavaMail.zimbra@odiso.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844) Thread-Topic: corosync bug: cluster break after 1 node clean shutdown Thread-Index: +rV+z1vk4ZrVyeCxyKMurD4FEuQgl2cIydIW X-SPAM-LEVEL: Spam detection results: 0 AWL 0.021 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Sep 2020 13:23:59 -0000 Looking at theses logs: Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs = lock update failed - Permission denied Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock -= cfs lock update failed - Permission denied in PVE/HA/Env/PVE2.pm " my $ctime =3D time(); my $last_lock_time =3D $last->{lock_time} // 0; my $last_got_lock =3D $last->{got_lock}; my $retry_timeout =3D 120; # hardcoded lock lifetime limit from pmxcfs eval { mkdir $lockdir; # pve cluster filesystem not online die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lock= dir; if (($ctime - $last_lock_time) < $retry_timeout) { # try cfs lock update request (utime) if (utime(0, $ctime, $filename)) { $got_lock =3D 1; return; } die "cfs lock update failed - $!\n"; } " If the retry_timeout is =3D 120, could it explain why I don't have log on o= thers node, if the watchdog trigger after 60s ? I don't known too much how locks are working in pmxcfs, but when a corosync= member leave or join, and a new cluster memership is formed, could we have some lock lost or hang ? ----- Mail original ----- De: "aderumier" =C3=80: "dietmar" Cc: "Proxmox VE development discussion" Envoy=C3=A9: Lundi 7 Septembre 2020 11:32:13 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own >>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-= proxmox-is-not-stable.75506/#post-336111=20 >>=20 >>No HA involved...=20 I had already help this user some week ago=20 https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboo= t-need-help.74643/#post-333093=20 HA was actived at this time. (Maybe the watchdog was still running, I'm not= sure if you disable HA from all vms if LRM disable the watchdog ?)=20 ----- Mail original -----=20 De: "dietmar" =20 =C3=80: "aderumier" =20 Cc: "Proxmox VE development discussion" =20 Envoy=C3=A9: Lundi 7 Septembre 2020 10:18:42=20 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own=20 There is a similar report in the forum:=20 https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-pr= oxmox-is-not-stable.75506/#post-336111=20 No HA involved...=20 > On 09/07/2020 9:19 AM Alexandre DERUMIER wrote:=20 >=20 >=20 > >>Indeed, this should not happen. Do you use a spearate network for coros= ync?=20 >=20 > No, I use 2x40GB lacp link.=20 >=20 > >>was there high traffic on the network?=20 >=20 > but I'm far from saturated them. (in pps or througput), (I'm around 3-4gb= ps)=20 >=20 >=20 > The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)= =20 >=20 >=20 > From my understanding, watchdog-mux was still runing as the watchdog have= reset only after 1min and not 10s,=20 > so it's like the lrm was blocked and not sending watchdog timer reset to = watchdog-mux.=20 >=20 >=20 > I'll do tests with softdog + soft_noboot=3D1, so if that happen again,I'l= l able to debug.=20 >=20 >=20 >=20 > >>What kind of maintenance was the reason for the shutdown?=20 >=20 > ram upgrade. (the server was running ok before shutdown, no hardware prob= lem)=20 > (I just shutdown the server, and don't have started it yet when problem o= ccur)=20 >=20 >=20 >=20 > >>Do you use the default corosync timeout values, or do you have a specia= l setup?=20 >=20 >=20 > no special tuning, default values. (I don't have any retransmit since mon= ths in the logs)=20 >=20 > >>Can you please post the full corosync config?=20 >=20 > (I have verified, the running version was corosync was 3.0.3 with libknet= 1.15)=20 >=20 >=20 > here the config:=20 >=20 > "=20 > logging {=20 > debug: off=20 > to_syslog: yes=20 > }=20 >=20 > nodelist {=20 > node {=20 > name: m6kvm1=20 > nodeid: 1=20 > quorum_votes: 1=20 > ring0_addr: m6kvm1=20 > }=20 > node {=20 > name: m6kvm10=20 > nodeid: 10=20 > quorum_votes: 1=20 > ring0_addr: m6kvm10=20 > }=20 > node {=20 > name: m6kvm11=20 > nodeid: 11=20 > quorum_votes: 1=20 > ring0_addr: m6kvm11=20 > }=20 > node {=20 > name: m6kvm12=20 > nodeid: 12=20 > quorum_votes: 1=20 > ring0_addr: m6kvm12=20 > }=20 > node {=20 > name: m6kvm13=20 > nodeid: 13=20 > quorum_votes: 1=20 > ring0_addr: m6kvm13=20 > }=20 > node {=20 > name: m6kvm14=20 > nodeid: 14=20 > quorum_votes: 1=20 > ring0_addr: m6kvm14=20 > }=20 > node {=20 > name: m6kvm2=20 > nodeid: 2=20 > quorum_votes: 1=20 > ring0_addr: m6kvm2=20 > }=20 > node {=20 > name: m6kvm3=20 > nodeid: 3=20 > quorum_votes: 1=20 > ring0_addr: m6kvm3=20 > }=20 > node {=20 > name: m6kvm4=20 > nodeid: 4=20 > quorum_votes: 1=20 > ring0_addr: m6kvm4=20 > }=20 > node {=20 > name: m6kvm5=20 > nodeid: 5=20 > quorum_votes: 1=20 > ring0_addr: m6kvm5=20 > }=20 > node {=20 > name: m6kvm6=20 > nodeid: 6=20 > quorum_votes: 1=20 > ring0_addr: m6kvm6=20 > }=20 > node {=20 > name: m6kvm7=20 > nodeid: 7=20 > quorum_votes: 1=20 > ring0_addr: m6kvm7=20 > }=20 >=20 > node {=20 > name: m6kvm8=20 > nodeid: 8=20 > quorum_votes: 1=20 > ring0_addr: m6kvm8=20 > }=20 > node {=20 > name: m6kvm9=20 > nodeid: 9=20 > quorum_votes: 1=20 > ring0_addr: m6kvm9=20 > }=20 > }=20 >=20 > quorum {=20 > provider: corosync_votequorum=20 > }=20 >=20 > totem {=20 > cluster_name: m6kvm=20 > config_version: 19=20 > interface {=20 > bindnetaddr: 10.3.94.89=20 > ringnumber: 0=20 > }=20 > ip_version: ipv4=20 > secauth: on=20 > transport: knet=20 > version: 2=20 > }=20 >=20 >=20 >=20 > ----- Mail original -----=20 > De: "dietmar" =20 > =C3=80: "aderumier" , "Proxmox VE development discus= sion" =20 > Cc: "pve-devel" =20 > Envoy=C3=A9: Dimanche 6 Septembre 2020 14:14:06=20 > Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shu= tdown=20 >=20 > > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds= )=20 > > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds= )=20 >=20 > Indeed, this should not happen. Do you use a spearate network for corosyn= c? Or=20 > was there high traffic on the network? What kind of maintenance was the r= eason=20 > for the shutdown?=20 _______________________________________________=20 pve-devel mailing list=20 pve-devel@lists.proxmox.com=20 https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel=20