From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id C877160949 for ; Thu, 10 Sep 2020 06:58:14 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id BD9D117DA2 for ; Thu, 10 Sep 2020 06:58:14 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id B9B6B17D98 for ; Thu, 10 Sep 2020 06:58:13 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 5096919B9A81; Thu, 10 Sep 2020 06:58:13 +0200 (CEST) Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 10oY7LebGNtO; Thu, 10 Sep 2020 06:58:13 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 33CA519B9A83; Thu, 10 Sep 2020 06:58:13 +0200 (CEST) X-Virus-Scanned: amavisd-new at mailpro.odiso.com Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 05Ev63Q7LuQK; Thu, 10 Sep 2020 06:58:13 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111]) by mailpro.odiso.net (Postfix) with ESMTP id 1C95A19B9A81; Thu, 10 Sep 2020 06:58:13 +0200 (CEST) Date: Thu, 10 Sep 2020 06:58:12 +0200 (CEST) From: Alexandre DERUMIER To: Thomas Lamprecht Cc: Proxmox VE development discussion , dietmar Message-ID: <761694744.496919.1599713892772.JavaMail.zimbra@odiso.com> In-Reply-To: References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com> <72727125.827.1599466723564@webmail.proxmox.com> <1066029576.414316.1599471133463.JavaMail.zimbra@odiso.com> <872332597.423950.1599485006085.JavaMail.zimbra@odiso.com> <1551800621.910.1599540071310@webmail.proxmox.com> <1680829869.439013.1599549082330.JavaMail.zimbra@odiso.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844) Thread-Topic: corosync bug: cluster break after 1 node clean shutdown Thread-Index: EIi2ck57+umnlY7Tk2w5W3INwxmFmw== X-SPAM-LEVEL: Spam detection results: 0 AWL 0.036 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2020 04:58:14 -0000 Thanks Thomas for the investigations. I'm still trying to reproduce... I think I have some special case here, because the user of the forum with 3= 0 nodes had corosync cluster split. (Note that I had this bug 6 months ago,= when shuting down a node too, and the only way was stop full stop corosync = on all nodes, and start corosync again on all nodes). But this time, corosync logs looks fine. (every node, correctly see node2 d= own, and see remaning nodes) surviving node7, was the only node with HA, and LRM didn't have enable watc= hog (I don't have found any log like "pve-ha-lrm: watchdog active" for the = last 6months on this nodes So, the timing was: 10:39:05 : "halt" command is send to node2 10:39:16 : node2 is leaving corosync / halt -> every node is seeing it and= correctly do a new membership with 13 remaining nodes ...don't see any special logs (corosync,pmxcfs,pve-ha-crm,pve-ha-lrm) after= the node2 leaving. But they are still activity on the server, pve-firewall is still logging, v= ms are running fine between 10:40:25 - 10:40:34 : watchdog reset nodes, but not node7. -> so between 70s-80s after the node2 was done, so I think that watchdog-mu= x was still running fine until that. (That's sound like lrm was stuck, and client_watchdog_timeout have expir= ed in watchdog-mux)=20 10:40:41 node7, loose quorum (as all others nodes have reset), 10:40:50: node7 crm/lrm finally log. Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error duri= ng cfs-locked 'domain-ha' operation: no quorum! Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs = lock update failed - Permission denied Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock -= cfs lock update failed - Permission denied So, I really think that something have stucked lrm/crm loop, and watchdog w= as not resetted because of that. ----- Mail original ----- De: "Thomas Lamprecht" =C3=80: "Proxmox VE development discussion" , = "aderumier" , "dietmar" Envoy=C3=A9: Mercredi 9 Septembre 2020 22:05:49 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own On 08.09.20 09:11, Alexandre DERUMIER wrote:=20 >>> It would really help if we can reproduce the bug somehow. Do you have a= nd idea how=20 >>> to trigger the bug?=20 >=20 > I really don't known. I'm currently trying to reproduce on the same clust= er, with softdog && noboot=3D1, and rebooting node.=20 >=20 >=20 > Maybe it's related with the number of vms, or the number of nodes, don't = have any clue ...=20 I checked a bit the watchdog code, our user-space mux one and the kernel dr= ivers,=20 and just noting a few things here (thinking out aloud):=20 The /dev/watchdog itself is always active, else we could loose it to some= =20 other program and not be able to activate HA dynamically.=20 But, as long as no HA service got active, it's a simple dummy "wake up ever= y=20 second and do an ioctl keep-alive update".=20 This is really simple and efficiently written, so if that fails for over 10= s=20 the systems is really loaded, probably barely responding to anything.=20 Currently the watchdog-mux runs as normal process, no re-nice, no real-time= =20 scheduling. This is IMO wrong, as it is a critical process which needs to b= e=20 run with high priority. I've a patch here which sets it to the highest RR= =20 realtime-scheduling priority available, effectively the same what corosync = does.=20 diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c=20 index 818ae00..71981d7 100644=20 --- a/src/watchdog-mux.c=20 +++ b/src/watchdog-mux.c=20 @@ -8,2 +8,3 @@=20 #include =20 +#include =20 #include =20 @@ -151,2 +177,15 @@ main(void)=20 + int sched_priority =3D sched_get_priority_max (SCHED_RR);=20 + if (sched_priority !=3D -1) {=20 + struct sched_param global_sched_param;=20 + global_sched_param.sched_priority =3D sched_priority;=20 + int res =3D sched_setscheduler (0, SCHED_RR, &global_sched_param);=20 + if (res =3D=3D -1) {=20 + fprintf(stderr, "Could not set SCHED_RR at priority %d\n", sched_priority= );=20 + } else {=20 + fprintf(stderr, "set SCHED_RR at priority %d\n", sched_priority);=20 + }=20 + }=20 +=20 +=20 if ((watchdog_fd =3D open(WATCHDOG_DEV, O_WRONLY)) =3D=3D -1) {=20 The issue with no HA but watchdog reset due to massively overloaded system= =20 should be avoided already a lot with the scheduling change alone.=20 Interesting, IMO, is that lots of nodes rebooted at the same time, with no = HA active.=20 This *could* come from a side-effect like ceph rebalacing kicking off and p= roducing=20 a load spike for >10s, hindering the scheduling of the watchdog-mux.=20 This is a theory, but with HA off it needs to be something like that, as in= HA-off=20 case there's *no* direct or indirect connection between corosync/pmxcfs and= the=20 watchdog-mux. It simply does not cares, or notices, quorum partition change= s at all.=20 There may be a approach to reserve the watchdog for the mux, but avoid havi= ng it=20 as "ticking time bomb":=20 Theoretically one could open it, then disable it with an ioctl (it can be q= ueried=20 if a driver support that) and only enable it for real once the first client= connects=20 to the MUX. This may not work for all watchdog modules, and if, we may want= to make=20 it configurable, as some people actually want a reset if a (future) real-ti= me process=20 cannot be scheduled for >=3D 10 seconds.=20 With HA active, well then there could be something off, either in corosync/= knet or=20 also in how we interface with it in pmxcfs, that could well be, but won't e= xplain the=20 non-HA issues.=20 Speaking of pmxcfs, that one runs also with standard priority, we may want = to change=20 that too to a RT scheduler, so that its ensured it can process all corosync= events.=20 I have also a few other small watchdog mux patches around, it should nowada= ys actually=20 be able to tell us why a reset happened (can also be over/under voltage, te= mperature,=20 ...) and I'll repeat doing the ioctl for keep-alive a few times if it fails= , can only=20 win with that after all.=20