From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 2600261F66 for ; Tue, 15 Sep 2020 16:09:54 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 150551B655 for ; Tue, 15 Sep 2020 16:09:54 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id E47211B648 for ; Tue, 15 Sep 2020 16:09:50 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 7A5E41A5F790; Tue, 15 Sep 2020 16:09:50 +0200 (CEST) Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id L7c5QSO2VhJv; Tue, 15 Sep 2020 16:09:50 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 5D4281A5F793; Tue, 15 Sep 2020 16:09:50 +0200 (CEST) X-Virus-Scanned: amavisd-new at mailpro.odiso.com Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id xgzUiWS5dfeR; Tue, 15 Sep 2020 16:09:50 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111]) by mailpro.odiso.net (Postfix) with ESMTP id 3A5221A5F792; Tue, 15 Sep 2020 16:09:50 +0200 (CEST) Date: Tue, 15 Sep 2020 16:09:50 +0200 (CEST) From: Alexandre DERUMIER To: Thomas Lamprecht Cc: Proxmox VE development discussion Message-ID: <1798333820.838842.1600178990068.JavaMail.zimbra@odiso.com> In-Reply-To: <43250fdc-55ba-03d9-2507-a2b08c5945ce@proxmox.com> References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <1746620611.752896.1600159335616.JavaMail.zimbra@odiso.com> <1464606394.823230.1600162557186.JavaMail.zimbra@odiso.com> <98e79e8d-9001-db77-c032-bdfcdb3698a6@proxmox.com> <1282130277.831843.1600164947209.JavaMail.zimbra@odiso.com> <1732268946.834480.1600167871823.JavaMail.zimbra@odiso.com> <1800811328.836757.1600174194769.JavaMail.zimbra@odiso.com> <43250fdc-55ba-03d9-2507-a2b08c5945ce@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844) Thread-Topic: corosync bug: cluster break after 1 node clean shutdown Thread-Index: fv+pgvAKYlvx4F4voW9O8fQWcfm/JQ== X-SPAM-LEVEL: Spam detection results: 0 AWL 0.092 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Sep 2020 14:09:54 -0000 >> >>Can you try to give pmxcfs real time scheduling, e.g., by doing:=20 >> >># systemctl edit pve-cluster=20 >> >>And then add snippet:=20 >> >> >>[Service]=20 >>CPUSchedulingPolicy=3Drr=20 >>CPUSchedulingPriority=3D99=20 yes, sure, I'll do it now > I'm currently digging the logs=20 >>Is your most simplest/stable reproducer still a periodic restart of coros= ync in one node?=20 yes, a simple "systemctl restart corosync" on 1 node each minute After 1hour, it's still locked. on other nodes, I still have pmxfs logs like: Sep 15 15:36:31 m6kvm2 pmxcfs[3474]: [status] notice: received log Sep 15 15:46:21 m6kvm2 pmxcfs[3474]: [status] notice: received log Sep 15 15:46:23 m6kvm2 pmxcfs[3474]: [status] notice: received log ... on node1, I just restarted the pve-cluster service with systemctl restart p= ve-cluster,=20 the pmxcfs process was killed, but not able to start it again and after that the /etc/pve become writable again on others node. (I don't have rebooted yet node1, if you want more test on pmxcfs) root@m6kvm1:~# systemctl status pve-cluster =E2=97=8F pve-cluster.service - The Proxmox VE cluster filesystem Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor= preset: enabled) Active: failed (Result: exit-code) since Tue 2020-09-15 15:52:11 CEST; 3= min 29s ago Process: 12536 ExecStart=3D/usr/bin/pmxcfs (code=3Dexited, status=3D255/E= XCEPTION) Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Service RestartSec= =3D100ms expired, scheduling restart. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Scheduled restart j= ob, restart counter is at 5. Sep 15 15:52:11 m6kvm1 systemd[1]: Stopped The Proxmox VE cluster filesyste= m. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Start request repea= ted too quickly. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Failed with result = 'exit-code'. Sep 15 15:52:11 m6kvm1 systemd[1]: Failed to start The Proxmox VE cluster f= ilesystem. manual "pmxcfs -d" https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize faile= d: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize servic= e Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed:= 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize servic= e Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connectio= n Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connect= ion Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: = 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize servic= e Sep 15 14:38:30 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (= cluster name m6kvm, version =3D 20) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474= , 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799,= 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronis= ation Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/34= 74, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/379= 9, 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncron= isation Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (= epoch 1/3491/00000064) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received sync request= (epoch 1/3491/00000063) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474,= 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, = 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates fro= m leader Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: que= ue length 23 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to dat= e Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: q= ueue length 157 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - tryin= g to commit (got 4 inode updates) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue= : queue length 31 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed:= 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize faile= d: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize servic= e Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed:= 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize servic= e Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connectio= n Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connect= ion Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: = 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize servic= e Sep 15 14:39:31 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (= cluster name m6kvm, version =3D 20) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474= , 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799,= 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronis= ation Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/34= 74, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/379= 9, 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncron= isation Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (= epoch 1/3491/00000065) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received sync request= (epoch 1/3491/00000064) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474,= 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, = 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates fro= m leader Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: que= ue length 20 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to dat= e Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - tryin= g to commit (got 9 inode updates) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue= : queue length 25 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed:= 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize faile= d: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize servic= e Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed:= 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize servic= e Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connectio= n Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connect= ion Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: = 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize servic= e Sep 15 14:40:33 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (= cluster name m6kvm, version =3D 20) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474= , 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799,= 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronis= ation Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/34= 74, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/379= 9, 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncron= isation Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (= epoch 1/3491/00000066) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received sync request= (epoch 1/3491/00000065) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474,= 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, = 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates fro= m leader Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: que= ue length 23 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to dat= e Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: q= ueue length 87 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - tryin= g to commit (got 6 inode updates) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue= : queue length 33 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed:= 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize faile= d: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize servic= e Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed:= 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize servic= e Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connectio= n Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connect= ion Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: = 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize servic= e Sep 15 14:41:34 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (= cluster name m6kvm, version =3D 20) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474= , 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799,= 12/3795, 13/3776, 14/3778 Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronis= ation Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/34= 74, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/379= 9, 12/3795, 13/3776, 14/3778 Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncron= isation Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (= epoch 1/3491/00000067) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received sync request= (epoch 1/3491/00000066) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to dat= e Sep 15 14:47:54 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:02:55 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:17:56 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:32:57 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:47:58 m6kvm1 pmxcfs[3491]: [status] notice: received log ----> restart 2352 [ 15/09/2020 15:52:00 ] systemctl restart pve-cluster Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] crit: fuse_mount error: Transp= ort endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] notice: exit proxmox configura= tion filesystem (-1) Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] crit: fuse_mount error: Transp= ort endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] notice: exit proxmox configura= tion filesystem (-1) Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] crit: fuse_mount error: Transp= ort endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] notice: exit proxmox configura= tion filesystem (-1) Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] crit: fuse_mount error: Transp= ort endpoint is not connected Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] notice: exit proxmox configura= tion filesystem (-1) Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] crit: fuse_mount error: Transp= ort endpoint is not connected Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] notice: exit proxmox configura= tion filesystem (-1) some interesting dmesg about "pvesr" [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120= seconds. [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #= 1 [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_sec= s" disables this message. [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 [Tue Sep 15 14:45:34 2020] Call Trace: [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ----- Mail original ----- De: "Thomas Lamprecht" =C3=80: "aderumier" , "Proxmox VE development discussi= on" Envoy=C3=A9: Mardi 15 Septembre 2020 15:00:03 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own On 9/15/20 2:49 PM, Alexandre DERUMIER wrote:=20 > Hi,=20 >=20 > I have produce it again,=20 >=20 > now I can't write to /etc/pve/ from any node=20 >=20 OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs= ,=20 not the HA LRM or watchdog mux itself.=20 Can you try to give pmxcfs real time scheduling, e.g., by doing:=20 # systemctl edit pve-cluster=20 And then add snippet:=20 [Service]=20 CPUSchedulingPolicy=3Drr=20 CPUSchedulingPriority=3D99=20 And restart pve-cluster=20 > I have also added some debug logs to pve-ha-lrm, and it was stuck in:=20 > (but if /etc/pve is locked, this is normal)=20 >=20 > if ($fence_request) {=20 > $haenv->log('err', "node need to be fenced - releasing agent_lock\n");=20 > $self->set_local_status({ state =3D> 'lost_agent_lock'});=20 > } elsif (!$self->get_protected_ha_agent_lock()) {=20 > $self->set_local_status({ state =3D> 'lost_agent_lock'});=20 > } elsif ($self->{mode} eq 'maintenance') {=20 > $self->set_local_status({ state =3D> 'maintenance'});=20 > }=20 >=20 >=20 > corosync quorum is currently ok=20 >=20 > I'm currently digging the logs=20 Is your most simplest/stable reproducer still a periodic restart of corosyn= c in one node?=20