From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 10D0161C34 for ; Tue, 15 Sep 2020 11:46:55 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 0036B18628 for ; Tue, 15 Sep 2020 11:46:55 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 442111861B for ; Tue, 15 Sep 2020 11:46:54 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 08E8744C2E; Tue, 15 Sep 2020 11:46:54 +0200 (CEST) To: Alexandre DERUMIER , Proxmox VE development discussion References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <2093781647.723563.1600072074707.JavaMail.zimbra@odiso.com> <88fe5075-870d-9197-7c84-71ae8a25e9dd@proxmox.com> <1775665592.735772.1600098305930.JavaMail.zimbra@odiso.com> <487514223.9.1600148741895@webmail.proxmox.com> <295606419.745430.1600151269212.JavaMail.zimbra@odiso.com> <1227203309.12.1600154034412@webmail.proxmox.com> <1746620611.752896.1600159335616.JavaMail.zimbra@odiso.com> <1464606394.823230.1600162557186.JavaMail.zimbra@odiso.com> From: Thomas Lamprecht Message-ID: <98e79e8d-9001-db77-c032-bdfcdb3698a6@proxmox.com> Date: Tue, 15 Sep 2020 11:46:51 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:81.0) Gecko/20100101 Thunderbird/81.0 MIME-Version: 1.0 In-Reply-To: <1464606394.823230.1600162557186.JavaMail.zimbra@odiso.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.202 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Sep 2020 09:46:55 -0000 On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: > Hi, > > I have finally reproduce it ! > > But this is with a corosync restart in cron each 1 minute, on node1 > > Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. > > here the logs with full corosync debug at the time of last corosync restart. > > node1 (where corosync is restarted each minute) > https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e > > node2 > https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 > > node5 > https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 > > I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 > Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds)