From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aderumier@odiso.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 2600261F66
 for <pve-devel@lists.proxmox.com>; Tue, 15 Sep 2020 16:09:54 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 150551B655
 for <pve-devel@lists.proxmox.com>; Tue, 15 Sep 2020 16:09:54 +0200 (CEST)
Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id E47211B648
 for <pve-devel@lists.proxmox.com>; Tue, 15 Sep 2020 16:09:50 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
 by mailpro.odiso.net (Postfix) with ESMTP id 7A5E41A5F790;
 Tue, 15 Sep 2020 16:09:50 +0200 (CEST)
Received: from mailpro.odiso.net ([127.0.0.1])
 by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id L7c5QSO2VhJv; Tue, 15 Sep 2020 16:09:50 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
 by mailpro.odiso.net (Postfix) with ESMTP id 5D4281A5F793;
 Tue, 15 Sep 2020 16:09:50 +0200 (CEST)
X-Virus-Scanned: amavisd-new at mailpro.odiso.com
Received: from mailpro.odiso.net ([127.0.0.1])
 by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id xgzUiWS5dfeR; Tue, 15 Sep 2020 16:09:50 +0200 (CEST)
Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111])
 by mailpro.odiso.net (Postfix) with ESMTP id 3A5221A5F792;
 Tue, 15 Sep 2020 16:09:50 +0200 (CEST)
Date: Tue, 15 Sep 2020 16:09:50 +0200 (CEST)
From: Alexandre DERUMIER <aderumier@odiso.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Message-ID: <1798333820.838842.1600178990068.JavaMail.zimbra@odiso.com>
In-Reply-To: <43250fdc-55ba-03d9-2507-a2b08c5945ce@proxmox.com>
References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com>
 <1746620611.752896.1600159335616.JavaMail.zimbra@odiso.com>
 <1464606394.823230.1600162557186.JavaMail.zimbra@odiso.com>
 <98e79e8d-9001-db77-c032-bdfcdb3698a6@proxmox.com>
 <1282130277.831843.1600164947209.JavaMail.zimbra@odiso.com>
 <1732268946.834480.1600167871823.JavaMail.zimbra@odiso.com>
 <1800811328.836757.1600174194769.JavaMail.zimbra@odiso.com>
 <43250fdc-55ba-03d9-2507-a2b08c5945ce@proxmox.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844)
Thread-Topic: corosync bug: cluster break after 1 node clean shutdown
Thread-Index: fv+pgvAKYlvx4F4voW9O8fQWcfm/JQ==
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.092 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_NONE     -0.0001 Sender listed at https://www.dnswl.org/,
 no trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean
 shutdown
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Tue, 15 Sep 2020 14:09:54 -0000

>>
>>Can you try to give pmxcfs real time scheduling, e.g., by doing:=20
>>
>># systemctl edit pve-cluster=20
>>
>>And then add snippet:=20
>>
>>
>>[Service]=20
>>CPUSchedulingPolicy=3Drr=20
>>CPUSchedulingPriority=3D99=20

yes, sure, I'll do it now


> I'm currently digging the logs=20
>>Is your most simplest/stable reproducer still a periodic restart of coros=
ync in one node?=20

yes, a simple "systemctl restart corosync" on 1 node each minute



After 1hour, it's still locked.

on other nodes, I still have pmxfs logs like:

Sep 15 15:36:31 m6kvm2 pmxcfs[3474]: [status] notice: received log
Sep 15 15:46:21 m6kvm2 pmxcfs[3474]: [status] notice: received log
Sep 15 15:46:23 m6kvm2 pmxcfs[3474]: [status] notice: received log
...


on node1, I just restarted the pve-cluster service with systemctl restart p=
ve-cluster,=20
the pmxcfs process was killed, but not able to start it again
and after that the /etc/pve become writable again on others node.

(I don't have rebooted yet node1, if you want more test on pmxcfs)



root@m6kvm1:~# systemctl status pve-cluster
=E2=97=8F pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor=
 preset: enabled)
   Active: failed (Result: exit-code) since Tue 2020-09-15 15:52:11 CEST; 3=
min 29s ago
  Process: 12536 ExecStart=3D/usr/bin/pmxcfs (code=3Dexited, status=3D255/E=
XCEPTION)

Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Service RestartSec=
=3D100ms expired, scheduling restart.
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Scheduled restart j=
ob, restart counter is at 5.
Sep 15 15:52:11 m6kvm1 systemd[1]: Stopped The Proxmox VE cluster filesyste=
m.
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Start request repea=
ted too quickly.
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Failed with result =
'exit-code'.
Sep 15 15:52:11 m6kvm1 systemd[1]: Failed to start The Proxmox VE cluster f=
ilesystem.

manual "pmxcfs -d"
https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e




Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize faile=
d: 2
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize servic=
e
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed:=
 2
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize servic=
e
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connectio=
n
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connect=
ion
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: =
2
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize servic=
e
Sep 15 14:38:30 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (=
cluster name  m6kvm, version =3D 20)
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474=
, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799,=
 12/3795, 13/3776, 14/3778
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronis=
ation
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/34=
74, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/379=
9, 12/3795, 13/3776, 14/3778
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncron=
isation
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (=
epoch 1/3491/00000064)
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received sync request=
 (epoch 1/3491/00000063)
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474,=
 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, =
12/3795, 13/3776, 14/3778
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates fro=
m leader
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: que=
ue length 23
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received all states
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to dat=
e
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: q=
ueue length 157
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - tryin=
g to commit (got 4 inode updates)
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue=
: queue length 31
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed:=
 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize faile=
d: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize servic=
e
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed:=
 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize servic=
e
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connectio=
n
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connect=
ion
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: =
2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize servic=
e
Sep 15 14:39:31 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (=
cluster name  m6kvm, version =3D 20)
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474=
, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799,=
 12/3795, 13/3776, 14/3778
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronis=
ation
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/34=
74, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/379=
9, 12/3795, 13/3776, 14/3778
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncron=
isation
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (=
epoch 1/3491/00000065)
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received sync request=
 (epoch 1/3491/00000064)
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474,=
 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, =
12/3795, 13/3776, 14/3778
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates fro=
m leader
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: que=
ue length 20
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received all states
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to dat=
e
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - tryin=
g to commit (got 9 inode updates)
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue=
: queue length 25
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed:=
 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize faile=
d: 2
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize servic=
e
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed:=
 2
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize servic=
e
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connectio=
n
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connect=
ion
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: =
2
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize servic=
e
Sep 15 14:40:33 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (=
cluster name  m6kvm, version =3D 20)
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474=
, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799,=
 12/3795, 13/3776, 14/3778
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronis=
ation
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/34=
74, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/379=
9, 12/3795, 13/3776, 14/3778
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncron=
isation
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (=
epoch 1/3491/00000066)
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received sync request=
 (epoch 1/3491/00000065)
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474,=
 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, =
12/3795, 13/3776, 14/3778
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates fro=
m leader
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: que=
ue length 23
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received all states
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to dat=
e
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: q=
ueue length 87
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - tryin=
g to commit (got 6 inode updates)
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue=
: queue length 33
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed:=
 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize faile=
d: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize servic=
e
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed:=
 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize servic=
e
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connectio=
n
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connect=
ion
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: =
2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize servic=
e
Sep 15 14:41:34 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (=
cluster name  m6kvm, version =3D 20)
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474=
, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799,=
 12/3795, 13/3776, 14/3778
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronis=
ation
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/34=
74, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/379=
9, 12/3795, 13/3776, 14/3778
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncron=
isation
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (=
epoch 1/3491/00000067)
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received sync request=
 (epoch 1/3491/00000066)
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received all states
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to dat=
e
Sep 15 14:47:54 m6kvm1 pmxcfs[3491]: [status] notice: received log
Sep 15 15:02:55 m6kvm1 pmxcfs[3491]: [status] notice: received log
Sep 15 15:17:56 m6kvm1 pmxcfs[3491]: [status] notice: received log
Sep 15 15:32:57 m6kvm1 pmxcfs[3491]: [status] notice: received log
Sep 15 15:47:58 m6kvm1 pmxcfs[3491]: [status] notice: received log

----> restart
 2352  [ 15/09/2020 15:52:00 ] systemctl restart pve-cluster


Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] crit: fuse_mount error: Transp=
ort endpoint is not connected
Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] notice: exit proxmox configura=
tion filesystem (-1)
Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] crit: fuse_mount error: Transp=
ort endpoint is not connected
Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] notice: exit proxmox configura=
tion filesystem (-1)
Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] crit: fuse_mount error: Transp=
ort endpoint is not connected
Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] notice: exit proxmox configura=
tion filesystem (-1)
Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] crit: fuse_mount error: Transp=
ort endpoint is not connected
Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] notice: exit proxmox configura=
tion filesystem (-1)
Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] crit: fuse_mount error: Transp=
ort endpoint is not connected
Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] notice: exit proxmox configura=
tion filesystem (-1)


some interesting dmesg about "pvesr"

[Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120=
 seconds.
[Tue Sep 15 14:45:34 2020]       Tainted: P           O      5.4.60-1-pve #=
1
[Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_sec=
s" disables this message.
[Tue Sep 15 14:45:34 2020] pvesr           D    0 19038      1 0x00000080
[Tue Sep 15 14:45:34 2020] Call Trace:
[Tue Sep 15 14:45:34 2020]  __schedule+0x2e6/0x6f0
[Tue Sep 15 14:45:34 2020]  ? filename_parentat.isra.57.part.58+0xf7/0x180
[Tue Sep 15 14:45:34 2020]  schedule+0x33/0xa0
[Tue Sep 15 14:45:34 2020]  rwsem_down_write_slowpath+0x2ed/0x4a0
[Tue Sep 15 14:45:34 2020]  down_write+0x3d/0x40
[Tue Sep 15 14:45:34 2020]  filename_create+0x8e/0x180
[Tue Sep 15 14:45:34 2020]  do_mkdirat+0x59/0x110
[Tue Sep 15 14:45:34 2020]  __x64_sys_mkdir+0x1b/0x20
[Tue Sep 15 14:45:34 2020]  do_syscall_64+0x57/0x190
[Tue Sep 15 14:45:34 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9




----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
=C3=80: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussi=
on" <pve-devel@lists.proxmox.com>
Envoy=C3=A9: Mardi 15 Septembre 2020 15:00:03
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd=
own

On 9/15/20 2:49 PM, Alexandre DERUMIER wrote:=20
> Hi,=20
>=20
> I have produce it again,=20
>=20
> now I can't write to /etc/pve/ from any node=20
>=20

OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs=
,=20
not the HA LRM or watchdog mux itself.=20

Can you try to give pmxcfs real time scheduling, e.g., by doing:=20

# systemctl edit pve-cluster=20

And then add snippet:=20


[Service]=20
CPUSchedulingPolicy=3Drr=20
CPUSchedulingPriority=3D99=20


And restart pve-cluster=20

> I have also added some debug logs to pve-ha-lrm, and it was stuck in:=20
> (but if /etc/pve is locked, this is normal)=20
>=20
> if ($fence_request) {=20
> $haenv->log('err', "node need to be fenced - releasing agent_lock\n");=20
> $self->set_local_status({ state =3D> 'lost_agent_lock'});=20
> } elsif (!$self->get_protected_ha_agent_lock()) {=20
> $self->set_local_status({ state =3D> 'lost_agent_lock'});=20
> } elsif ($self->{mode} eq 'maintenance') {=20
> $self->set_local_status({ state =3D> 'maintenance'});=20
> }=20
>=20
>=20
> corosync quorum is currently ok=20
>=20
> I'm currently digging the logs=20
Is your most simplest/stable reproducer still a periodic restart of corosyn=
c in one node?=20