From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 7C7DC60D35 for ; Fri, 25 Sep 2020 11:46:53 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 72AB81A422 for ; Fri, 25 Sep 2020 11:46:53 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 81F1E1A418 for ; Fri, 25 Sep 2020 11:46:52 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 5B1FF1BBAEAD for ; Fri, 25 Sep 2020 11:46:46 +0200 (CEST) Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id kpii7x6cxodN for ; Fri, 25 Sep 2020 11:46:46 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 422811BBAEB0 for ; Fri, 25 Sep 2020 11:46:46 +0200 (CEST) X-Virus-Scanned: amavisd-new at mailpro.odiso.com Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id Vd8L2v9UOsgg for ; Fri, 25 Sep 2020 11:46:46 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111]) by mailpro.odiso.net (Postfix) with ESMTP id 2BA791BBAEAD for ; Fri, 25 Sep 2020 11:46:46 +0200 (CEST) Date: Fri, 25 Sep 2020 11:46:46 +0200 (CEST) From: Alexandre DERUMIER To: Proxmox VE development discussion Message-ID: <1157671072.1253096.1601027205997.JavaMail.zimbra@odiso.com> In-Reply-To: <1601024991.2yoxd1np1v.astroid@nora.none> References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <7286111.1004215.1600753433567.JavaMail.zimbra@odiso.com> <1600955976.my09kyug7a.astroid@nora.none> <448071987.1233936.1600957757611.JavaMail.zimbra@odiso.com> <906463192.1239798.1600970863982.JavaMail.zimbra@odiso.com> <2019315449.1247737.1601016264202.JavaMail.zimbra@odiso.com> <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com> <1601024991.2yoxd1np1v.astroid@nora.none> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844) Thread-Topic: corosync bug: cluster break after 1 node clean shutdown Thread-Index: ey28wZlYTBhHIwOrGBVoMk/QYGVu/Q== X-SPAM-LEVEL: Spam detection results: 0 AWL 0.406 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [odiso.net, proxmox.com] Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Sep 2020 09:46:53 -0000 >>I will rebuild once more modifying the send code a bit to log a lot more >>details when sending state messages, it would be great if you could >>repeat with that as we are still unable to reproduce the issue. ok, no problem, I'm able to easily reproduce it, I'll do new test when you'= ll send the new version. (and thanks again to debugging this, because It's really beyond my competen= ce) ----- Mail original ----- De: "Fabian Gr=C3=BCnbichler" =C3=80: "Proxmox VE development discussion" Envoy=C3=A9: Vendredi 25 Septembre 2020 11:19:04 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own On September 25, 2020 9:15 am, Alexandre DERUMIER wrote:=20 >=20 > Another hang, this time on corosync stop, coredump available=20 >=20 > http://odisoweb1.odiso.net/test3/=20 >=20 >=20 > node1=20 > ----=20 > stop corosync : 09:03:10=20 >=20 > node2: /etc/pve locked=20 > ------=20 > Current time : 09:03:10=20 thanks, these all indicate the same symptoms:=20 1. cluster config changes (corosync goes down/comes back up in this case)= =20 2. pmxcfs starts sync process=20 3. all (online) nodes receive sync request for dcdb and status=20 4. all nodes send state for dcdb and status via CPG=20 5. all nodes receive state for dcdb and status from all nodes except one (1= 3 in test 2, 10 in test 3)=20 in step 5, there is no trace of the message on the receiving side, even=20 though the sending node does not log an error. as before, the hang is=20 just a side-effect of the state machine ending up in a state that should=20 be short-lived (syncing, waiting for state from all nodes) with no=20 progress. the code and theory say that this should not happen, as either=20 sending the state fails triggering the node to leave the CPG (restarting=20 the sync), or a node drops out of quorum (triggering a config change,=20 which triggers restarting the sync), or we get all states from all nodes=20 and the sync proceeds. this looks to me like a fundamental=20 assumption/guarantee does not hold..=20 I will rebuild once more modifying the send code a bit to log a lot more=20 details when sending state messages, it would be great if you could=20 repeat with that as we are still unable to reproduce the issue.=20 hopefully those logs will then indicate whether this is a corosync/knet=20 bug, or if the issue is in our state machine code somewhere. so far it=20 looks more like the former..=20 _______________________________________________=20 pve-devel mailing list=20 pve-devel@lists.proxmox.com=20 https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel=20