From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aderumier@odiso.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id C87F861A52
 for <pve-devel@lists.proxmox.com>; Mon,  7 Sep 2020 15:23:59 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id B6CEEB3C9
 for <pve-devel@lists.proxmox.com>; Mon,  7 Sep 2020 15:23:29 +0200 (CEST)
Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 6DAF1B3BE
 for <pve-devel@lists.proxmox.com>; Mon,  7 Sep 2020 15:23:28 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
 by mailpro.odiso.net (Postfix) with ESMTP id 72F1B18D9404;
 Mon,  7 Sep 2020 15:23:26 +0200 (CEST)
Received: from mailpro.odiso.net ([127.0.0.1])
 by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id uJR_i7sAOHg0; Mon,  7 Sep 2020 15:23:26 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
 by mailpro.odiso.net (Postfix) with ESMTP id 5808418D9405;
 Mon,  7 Sep 2020 15:23:26 +0200 (CEST)
X-Virus-Scanned: amavisd-new at mailpro.odiso.com
Received: from mailpro.odiso.net ([127.0.0.1])
 by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id cQkJC-X4F3Dd; Mon,  7 Sep 2020 15:23:26 +0200 (CEST)
Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111])
 by mailpro.odiso.net (Postfix) with ESMTP id 3F3A718D9404;
 Mon,  7 Sep 2020 15:23:26 +0200 (CEST)
Date: Mon, 7 Sep 2020 15:23:26 +0200 (CEST)
From: Alexandre DERUMIER <aderumier@odiso.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Message-ID: <872332597.423950.1599485006085.JavaMail.zimbra@odiso.com>
In-Reply-To: <1066029576.414316.1599471133463.JavaMail.zimbra@odiso.com>
References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com>
 <469910091.758.1599366116137@webmail.proxmox.com>
 <570223166.391607.1599370570342.JavaMail.zimbra@odiso.com>
 <1059698258.392627.1599381816979.JavaMail.zimbra@odiso.com>
 <508219973.775.1599394447359@webmail.proxmox.com>
 <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com>
 <72727125.827.1599466723564@webmail.proxmox.com>
 <1066029576.414316.1599471133463.JavaMail.zimbra@odiso.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844)
Thread-Topic: corosync bug: cluster break after 1 node clean shutdown
Thread-Index: +rV+z1vk4ZrVyeCxyKMurD4FEuQgl2cIydIW
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.021 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_NONE     -0.0001 Sender listed at https://www.dnswl.org/,
 no trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean
 shutdown
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Mon, 07 Sep 2020 13:23:59 -0000

Looking at theses logs:

Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs =
lock update failed - Permission denied
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock -=
 cfs lock update failed - Permission denied

in PVE/HA/Env/PVE2.pm
"
    my $ctime =3D time();
    my $last_lock_time =3D $last->{lock_time} // 0;
    my $last_got_lock =3D $last->{got_lock};

    my $retry_timeout =3D 120; # hardcoded lock lifetime limit from pmxcfs

    eval {

        mkdir $lockdir;

        # pve cluster filesystem not online
        die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lock=
dir;

        if (($ctime - $last_lock_time) < $retry_timeout) {
            # try cfs lock update request (utime)
            if (utime(0, $ctime, $filename))  {
                $got_lock =3D 1;
                return;
            }
            die "cfs lock update failed - $!\n";
        }
"


If the retry_timeout is =3D 120, could it explain why I don't have log on o=
thers node, if the watchdog trigger after 60s ?

I don't known too much how locks are working in pmxcfs, but when a corosync=
 member leave or join, and a new cluster memership is formed,
could we have some lock lost or hang ?


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
=C3=80: "dietmar" <dietmar@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoy=C3=A9: Lundi 7 Septembre 2020 11:32:13
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd=
own

>>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-=
proxmox-is-not-stable.75506/#post-336111=20
>>=20
>>No HA involved...=20

I had already help this user some week ago=20

https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboo=
t-need-help.74643/#post-333093=20

HA was actived at this time. (Maybe the watchdog was still running, I'm not=
 sure if you disable HA from all vms if LRM disable the watchdog ?)=20


----- Mail original -----=20
De: "dietmar" <dietmar@proxmox.com>=20
=C3=80: "aderumier" <aderumier@odiso.com>=20
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>=20
Envoy=C3=A9: Lundi 7 Septembre 2020 10:18:42=20
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd=
own=20

There is a similar report in the forum:=20

https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-pr=
oxmox-is-not-stable.75506/#post-336111=20

No HA involved...=20


> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier@odiso.com> wrote:=20
>=20
>=20
> >>Indeed, this should not happen. Do you use a spearate network for coros=
ync?=20
>=20
> No, I use 2x40GB lacp link.=20
>=20
> >>was there high traffic on the network?=20
>=20
> but I'm far from saturated them. (in pps or througput), (I'm around 3-4gb=
ps)=20
>=20
>=20
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)=
=20
>=20
>=20
> From my understanding, watchdog-mux was still runing as the watchdog have=
 reset only after 1min and not 10s,=20
> so it's like the lrm was blocked and not sending watchdog timer reset to =
watchdog-mux.=20
>=20
>=20
> I'll do tests with softdog + soft_noboot=3D1, so if that happen again,I'l=
l able to debug.=20
>=20
>=20
>=20
> >>What kind of maintenance was the reason for the shutdown?=20
>=20
> ram upgrade. (the server was running ok before shutdown, no hardware prob=
lem)=20
> (I just shutdown the server, and don't have started it yet when problem o=
ccur)=20
>=20
>=20
>=20
> >>Do you use the default corosync timeout values, or do you have a specia=
l setup?=20
>=20
>=20
> no special tuning, default values. (I don't have any retransmit since mon=
ths in the logs)=20
>=20
> >>Can you please post the full corosync config?=20
>=20
> (I have verified, the running version was corosync was 3.0.3 with libknet=
 1.15)=20
>=20
>=20
> here the config:=20
>=20
> "=20
> logging {=20
> debug: off=20
> to_syslog: yes=20
> }=20
>=20
> nodelist {=20
> node {=20
> name: m6kvm1=20
> nodeid: 1=20
> quorum_votes: 1=20
> ring0_addr: m6kvm1=20
> }=20
> node {=20
> name: m6kvm10=20
> nodeid: 10=20
> quorum_votes: 1=20
> ring0_addr: m6kvm10=20
> }=20
> node {=20
> name: m6kvm11=20
> nodeid: 11=20
> quorum_votes: 1=20
> ring0_addr: m6kvm11=20
> }=20
> node {=20
> name: m6kvm12=20
> nodeid: 12=20
> quorum_votes: 1=20
> ring0_addr: m6kvm12=20
> }=20
> node {=20
> name: m6kvm13=20
> nodeid: 13=20
> quorum_votes: 1=20
> ring0_addr: m6kvm13=20
> }=20
> node {=20
> name: m6kvm14=20
> nodeid: 14=20
> quorum_votes: 1=20
> ring0_addr: m6kvm14=20
> }=20
> node {=20
> name: m6kvm2=20
> nodeid: 2=20
> quorum_votes: 1=20
> ring0_addr: m6kvm2=20
> }=20
> node {=20
> name: m6kvm3=20
> nodeid: 3=20
> quorum_votes: 1=20
> ring0_addr: m6kvm3=20
> }=20
> node {=20
> name: m6kvm4=20
> nodeid: 4=20
> quorum_votes: 1=20
> ring0_addr: m6kvm4=20
> }=20
> node {=20
> name: m6kvm5=20
> nodeid: 5=20
> quorum_votes: 1=20
> ring0_addr: m6kvm5=20
> }=20
> node {=20
> name: m6kvm6=20
> nodeid: 6=20
> quorum_votes: 1=20
> ring0_addr: m6kvm6=20
> }=20
> node {=20
> name: m6kvm7=20
> nodeid: 7=20
> quorum_votes: 1=20
> ring0_addr: m6kvm7=20
> }=20
>=20
> node {=20
> name: m6kvm8=20
> nodeid: 8=20
> quorum_votes: 1=20
> ring0_addr: m6kvm8=20
> }=20
> node {=20
> name: m6kvm9=20
> nodeid: 9=20
> quorum_votes: 1=20
> ring0_addr: m6kvm9=20
> }=20
> }=20
>=20
> quorum {=20
> provider: corosync_votequorum=20
> }=20
>=20
> totem {=20
> cluster_name: m6kvm=20
> config_version: 19=20
> interface {=20
> bindnetaddr: 10.3.94.89=20
> ringnumber: 0=20
> }=20
> ip_version: ipv4=20
> secauth: on=20
> transport: knet=20
> version: 2=20
> }=20
>=20
>=20
>=20
> ----- Mail original -----=20
> De: "dietmar" <dietmar@proxmox.com>=20
> =C3=80: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discus=
sion" <pve-devel@lists.proxmox.com>=20
> Cc: "pve-devel" <pve-devel@pve.proxmox.com>=20
> Envoy=C3=A9: Dimanche 6 Septembre 2020 14:14:06=20
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shu=
tdown=20
>=20
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds=
)=20
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds=
)=20
>=20
> Indeed, this should not happen. Do you use a spearate network for corosyn=
c? Or=20
> was there high traffic on the network? What kind of maintenance was the r=
eason=20
> for the shutdown?=20


_______________________________________________=20
pve-devel mailing list=20
pve-devel@lists.proxmox.com=20
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel=20