From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dietmar@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 7705D61911
 for <pve-devel@lists.proxmox.com>; Mon,  7 Sep 2020 10:19:58 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 6371088E6
 for <pve-devel@lists.proxmox.com>; Mon,  7 Sep 2020 10:19:28 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 5BA6D88DC
 for <pve-devel@lists.proxmox.com>; Mon,  7 Sep 2020 10:19:27 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 21DA944A7D;
 Mon,  7 Sep 2020 10:19:27 +0200 (CEST)
Date: Mon, 7 Sep 2020 10:18:42 +0200 (CEST)
From: dietmar <dietmar@proxmox.com>
To: Alexandre DERUMIER <aderumier@odiso.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Message-ID: <72727125.827.1599466723564@webmail.proxmox.com>
In-Reply-To: <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com>
References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com>
 <1667839988.383835.1599312761359.JavaMail.zimbra@odiso.com>
 <665305060.757.1599319409105@webmail.proxmox.com>
 <1710924670.385348.1599327014568.JavaMail.zimbra@odiso.com>
 <469910091.758.1599366116137@webmail.proxmox.com>
 <570223166.391607.1599370570342.JavaMail.zimbra@odiso.com>
 <1059698258.392627.1599381816979.JavaMail.zimbra@odiso.com>
 <508219973.775.1599394447359@webmail.proxmox.com>
 <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Priority: 3
Importance: Normal
X-Mailer: Open-Xchange Mailer v7.10.3-Rev21
X-Originating-Client: open-xchange-appsuite
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.094 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean
 shutdown
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Mon, 07 Sep 2020 08:19:58 -0000

There is a similar report in the forum:

https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-pr=
oxmox-is-not-stable.75506/#post-336111

No HA involved...


> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier@odiso.com> wrote:
>=20
> =20
> >>Indeed, this should not happen. Do you use a spearate network for coros=
ync?=20
>=20
> No, I use 2x40GB lacp link.=20
>=20
> >>was there high traffic on the network?=20
>=20
> but I'm far from saturated them. (in pps or througput),  (I'm around 3-4g=
bps)
>=20
>=20
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)
>=20
>=20
> From my understanding, watchdog-mux was still runing as the watchdog have=
 reset only after 1min and not 10s,
>  so it's like the lrm was blocked and not sending watchdog timer reset to=
 watchdog-mux.
>=20
>=20
> I'll do tests with softdog + soft_noboot=3D1, so if that happen again,I'l=
l able to debug.
>=20
>=20
>=20
> >>What kind of maintenance was the reason for the shutdown?
>=20
> ram upgrade. (the server was running ok before shutdown, no hardware prob=
lem) =20
> (I just shutdown the server, and don't have started it yet when problem o=
ccur)
>=20
>=20
>=20
> >>Do you use the default corosync timeout values, or do you have a specia=
l setup?
>=20
>=20
> no special tuning, default values. (I don't have any retransmit since mon=
ths in the logs)
>=20
> >>Can you please post the full corosync config?
>=20
> (I have verified, the running version was corosync was 3.0.3 with libknet=
 1.15)
>=20
>=20
> here the config:
>=20
> "
> logging {
>   debug: off
>   to_syslog: yes
> }
>=20
> nodelist {
>   node {
>     name: m6kvm1
>     nodeid: 1
>     quorum_votes: 1
>     ring0_addr: m6kvm1
>   }
>   node {
>     name: m6kvm10
>     nodeid: 10
>     quorum_votes: 1
>     ring0_addr: m6kvm10
>   }
>   node {
>     name: m6kvm11
>     nodeid: 11
>     quorum_votes: 1
>     ring0_addr: m6kvm11
>   }
>   node {
>     name: m6kvm12
>     nodeid: 12
>     quorum_votes: 1
>     ring0_addr: m6kvm12
>   }
>   node {
>     name: m6kvm13
>     nodeid: 13
>     quorum_votes: 1
>     ring0_addr: m6kvm13
>   }
>   node {
>     name: m6kvm14
>     nodeid: 14
>     quorum_votes: 1
>     ring0_addr: m6kvm14
>   }
>   node {
>     name: m6kvm2
>     nodeid: 2
>     quorum_votes: 1
>     ring0_addr: m6kvm2
>   }
>   node {
>     name: m6kvm3
>     nodeid: 3
>     quorum_votes: 1
>     ring0_addr: m6kvm3
>   }
>   node {
>     name: m6kvm4
>     nodeid: 4
>     quorum_votes: 1
>     ring0_addr: m6kvm4
>   }
>   node {
>     name: m6kvm5
>     nodeid: 5
>     quorum_votes: 1
>     ring0_addr: m6kvm5
>   }
>   node {
>     name: m6kvm6
>     nodeid: 6
>     quorum_votes: 1
>     ring0_addr: m6kvm6
>   }
>   node {
>     name: m6kvm7
>     nodeid: 7
>     quorum_votes: 1
>     ring0_addr: m6kvm7
>   }
>=20
>   node {
>     name: m6kvm8
>     nodeid: 8
>     quorum_votes: 1
>     ring0_addr: m6kvm8
>   }
>   node {
>     name: m6kvm9
>     nodeid: 9
>     quorum_votes: 1
>     ring0_addr: m6kvm9
>   }
> }
>=20
> quorum {
>   provider: corosync_votequorum
> }
>=20
> totem {
>   cluster_name: m6kvm
>   config_version: 19
>   interface {
>     bindnetaddr: 10.3.94.89
>     ringnumber: 0
>   }
>   ip_version: ipv4
>   secauth: on
>   transport: knet
>   version: 2
> }
>=20
>=20
>=20
> ----- Mail original -----
> De: "dietmar" <dietmar@proxmox.com>
> =C3=80: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discus=
sion" <pve-devel@lists.proxmox.com>
> Cc: "pve-devel" <pve-devel@pve.proxmox.com>
> Envoy=C3=A9: Dimanche 6 Septembre 2020 14:14:06
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shu=
tdown
>=20
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds=
)=20
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds=
)=20
>=20
> Indeed, this should not happen. Do you use a spearate network for corosyn=
c? Or=20
> was there high traffic on the network? What kind of maintenance was the r=
eason=20
> for the shutdown?