From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.gruenbichler@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 1ADC860DA9
 for <pve-devel@lists.proxmox.com>; Fri, 25 Sep 2020 11:19:14 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 1092719FCF
 for <pve-devel@lists.proxmox.com>; Fri, 25 Sep 2020 11:19:14 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 3602219FC5
 for <pve-devel@lists.proxmox.com>; Fri, 25 Sep 2020 11:19:13 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id ED66A4558B
 for <pve-devel@lists.proxmox.com>; Fri, 25 Sep 2020 11:19:12 +0200 (CEST)
Date: Fri, 25 Sep 2020 11:19:04 +0200
From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com>
 <501f031f-3f1b-0633-fab3-7fcb7fdddaf5@proxmox.com>
 <335862527.964527.1600646099489.JavaMail.zimbra@odiso.com>
 <7286111.1004215.1600753433567.JavaMail.zimbra@odiso.com>
 <1600955976.my09kyug7a.astroid@nora.none>
 <448071987.1233936.1600957757611.JavaMail.zimbra@odiso.com>
 <906463192.1239798.1600970863982.JavaMail.zimbra@odiso.com>
 <2019315449.1247737.1601016264202.JavaMail.zimbra@odiso.com>
 <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com>
In-Reply-To: <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com>
MIME-Version: 1.0
User-Agent: astroid/0.15.0 (https://github.com/astroidmail/astroid)
Message-Id: <1601024991.2yoxd1np1v.astroid@nora.none>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.028 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [odiso.net]
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean
 shutdown
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Fri, 25 Sep 2020 09:19:14 -0000

On September 25, 2020 9:15 am, Alexandre DERUMIER wrote:
>=20
> Another hang, this time on corosync stop, coredump available
>=20
> http://odisoweb1.odiso.net/test3/=20
>=20
>                                                                          =
                                   =20
> node1
> ----
> stop corosync : 09:03:10
>=20
> node2: /etc/pve locked
> ------
> Current time : 09:03:10

thanks, these all indicate the same symptoms:

1. cluster config changes (corosync goes down/comes back up in this case)
2. pmxcfs starts sync process
3. all (online) nodes receive sync request for dcdb and status
4. all nodes send state for dcdb and status via CPG
5. all nodes receive state for dcdb and status from all nodes except one (1=
3 in test 2, 10 in test 3)

in step 5, there is no trace of the message on the receiving side, even=20
though the sending node does not log an error. as before, the hang is=20
just a side-effect of the state machine ending up in a state that should=20
be short-lived (syncing, waiting for state from all nodes) with no=20
progress. the code and theory say that this should not happen, as either=20
sending the state fails triggering the node to leave the CPG (restarting=20
the sync), or a node drops out of quorum (triggering a config change,=20
which triggers restarting the sync), or we get all states from all nodes=20
and the sync proceeds. this looks to me like a fundamental=20
assumption/guarantee does not hold..

I will rebuild once more modifying the send code a bit to log a lot more=20
details when sending state messages, it would be great if you could=20
repeat with that as we are still unable to reproduce the issue.=20
hopefully those logs will then indicate whether this is a corosync/knet=20
bug, or if the issue is in our state machine code somewhere. so far it=20
looks more like the former..
=