From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 1ADC860DA9 for ; Fri, 25 Sep 2020 11:19:14 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 1092719FCF for ; Fri, 25 Sep 2020 11:19:14 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 3602219FC5 for ; Fri, 25 Sep 2020 11:19:13 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id ED66A4558B for ; Fri, 25 Sep 2020 11:19:12 +0200 (CEST) Date: Fri, 25 Sep 2020 11:19:04 +0200 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox VE development discussion References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <501f031f-3f1b-0633-fab3-7fcb7fdddaf5@proxmox.com> <335862527.964527.1600646099489.JavaMail.zimbra@odiso.com> <7286111.1004215.1600753433567.JavaMail.zimbra@odiso.com> <1600955976.my09kyug7a.astroid@nora.none> <448071987.1233936.1600957757611.JavaMail.zimbra@odiso.com> <906463192.1239798.1600970863982.JavaMail.zimbra@odiso.com> <2019315449.1247737.1601016264202.JavaMail.zimbra@odiso.com> <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com> In-Reply-To: <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com> MIME-Version: 1.0 User-Agent: astroid/0.15.0 (https://github.com/astroidmail/astroid) Message-Id: <1601024991.2yoxd1np1v.astroid@nora.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 0.028 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [odiso.net] Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Sep 2020 09:19:14 -0000 On September 25, 2020 9:15 am, Alexandre DERUMIER wrote: >=20 > Another hang, this time on corosync stop, coredump available >=20 > http://odisoweb1.odiso.net/test3/=20 >=20 > = =20 > node1 > ---- > stop corosync : 09:03:10 >=20 > node2: /etc/pve locked > ------ > Current time : 09:03:10 thanks, these all indicate the same symptoms: 1. cluster config changes (corosync goes down/comes back up in this case) 2. pmxcfs starts sync process 3. all (online) nodes receive sync request for dcdb and status 4. all nodes send state for dcdb and status via CPG 5. all nodes receive state for dcdb and status from all nodes except one (1= 3 in test 2, 10 in test 3) in step 5, there is no trace of the message on the receiving side, even=20 though the sending node does not log an error. as before, the hang is=20 just a side-effect of the state machine ending up in a state that should=20 be short-lived (syncing, waiting for state from all nodes) with no=20 progress. the code and theory say that this should not happen, as either=20 sending the state fails triggering the node to leave the CPG (restarting=20 the sync), or a node drops out of quorum (triggering a config change,=20 which triggers restarting the sync), or we get all states from all nodes=20 and the sync proceeds. this looks to me like a fundamental=20 assumption/guarantee does not hold.. I will rebuild once more modifying the send code a bit to log a lot more=20 details when sending state messages, it would be great if you could=20 repeat with that as we are still unable to reproduce the issue.=20 hopefully those logs will then indicate whether this is a corosync/knet=20 bug, or if the issue is in our state machine code somewhere. so far it=20 looks more like the former.. =