public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] corosync bug: cluster break after 1 node clean shutdown
@ 2020-09-03 14:11 Alexandre DERUMIER
  2020-09-04 12:29 ` Alexandre DERUMIER
                   ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-03 14:11 UTC (permalink / raw)
  To: pve-devel

Hi,

I had a problem this morning with corosync, after shutting down cleanly a node from the cluster.
This is a 14 nodes cluster,  node2 was shutted. (with "halt" command)

HA was actived, and all nodes have been rebooted.
(I have already see this problem some months ago, without HA enabled, and it was like half of the nodes didn't see others nodes, like 2 cluster formations)

Some users have reported similar problems when adding a new node
https://forum.proxmox.com/threads/quorum-lost-when-adding-new-7th-node.75197/


Here the full logs of the nodes, for each node I'm seeing quorum with 13 nodes. (so it's seem to be ok).
I didn't have time to look live on server, as HA have restarted them.
So, I don't known if it's a corosync bug  or something related to crm,lrm,pmxcs (but I don't see any special logs)

Only node7 have survived a little bit longer,and I had stop lrm/crm to avoid reboot.

libknet1: 1.15-pve1
corosync: 3.0.3-pve1


Any ideas before submiting bug to corosync mailing ? (I'm seeing new libknet && corosync version on their github, I don't have read the changelog yet)




node1
-----
Sep  3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm1 corosync[3678]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm1 corosync[3678]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm1 corosync[3678]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm1 pmxcfs[3527]: [dcdb] notice: cpg_send_message retried 1 times
Sep  3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: received all states
Sep  3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: all data is up to date
Sep  3 10:39:20 m6kvm1 corosync[3678]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:20 m6kvm1 corosync[3678]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm1 corosync[3678]:   [KNET  ] host: host: 2 has no active links

--> reboot

node2 : shutdown log
-----
Sep  3 10:39:05 m6kvm2 smartd[27390]: smartd received signal 15: Terminated
Sep  3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302Y9120CGN.ata.state
Sep  3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302MG120CGN.ata.state
Sep  3 10:39:05 m6kvm2 smartd[27390]: smartd is exiting (exit status 0)
Sep  3 10:39:05 m6kvm2 nrpe[20095]: Caught SIGTERM - shutting down...
Sep  3 10:39:05 m6kvm2 nrpe[20095]: Daemon shutdown
Sep  3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> starting task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam:
Sep  3 10:39:07 m6kvm2 pve-guests[38552]: all VMs and CTs stopped
Sep  3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> end task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: OK
Sep  3 10:39:07 m6kvm2 spiceproxy[3847]: received signal TERM
Sep  3 10:39:07 m6kvm2 spiceproxy[3847]: server closing
Sep  3 10:39:07 m6kvm2 spiceproxy[36572]: worker exit
Sep  3 10:39:07 m6kvm2 spiceproxy[3847]: worker 36572 finished
Sep  3 10:39:07 m6kvm2 spiceproxy[3847]: server stopped
Sep  3 10:39:07 m6kvm2 pvestatd[3786]: received signal TERM
Sep  3 10:39:07 m6kvm2 pvestatd[3786]: server closing
Sep  3 10:39:07 m6kvm2 pvestatd[3786]: server stopped
Sep  3 10:39:07 m6kvm2 pve-ha-lrm[12768]: received signal TERM
Sep  3 10:39:07 m6kvm2 pve-ha-lrm[12768]: got shutdown request with shutdown policy 'conditional'
Sep  3 10:39:07 m6kvm2 pve-ha-lrm[12768]: shutdown LRM, stop all services
Sep  3 10:39:08 m6kvm2 pvefw-logger[36670]: received terminate request (signal)
Sep  3 10:39:08 m6kvm2 pvefw-logger[36670]: stopping pvefw logger
Sep  3 10:39:08 m6kvm2 pve-ha-lrm[12768]: watchdog closed (disabled)
Sep  3 10:39:08 m6kvm2 pve-ha-lrm[12768]: server stopped
Sep  3 10:39:10 m6kvm2 pve-ha-crm[12847]: received signal TERM
Sep  3 10:39:10 m6kvm2 pve-ha-crm[12847]: server received shutdown request
Sep  3 10:39:10 m6kvm2 pveproxy[24735]: received signal TERM
Sep  3 10:39:10 m6kvm2 pveproxy[24735]: server closing
Sep  3 10:39:10 m6kvm2 pveproxy[29731]: worker exit
Sep  3 10:39:10 m6kvm2 pveproxy[30873]: worker exit
Sep  3 10:39:10 m6kvm2 pveproxy[24735]: worker 30873 finished
Sep  3 10:39:10 m6kvm2 pveproxy[24735]: worker 31391 finished
Sep  3 10:39:10 m6kvm2 pveproxy[24735]: worker 29731 finished
Sep  3 10:39:10 m6kvm2 pveproxy[24735]: server stopped
Sep  3 10:39:14 m6kvm2 pve-ha-crm[12847]: server stopped


node3
-----
Sep  3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm3 corosync[30580]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm3 corosync[30580]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm3 corosync[30580]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: cpg_send_message retried 1 times
Sep  3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: received all states
Sep  3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: all data is up to date
Sep  3 10:39:20 m6kvm3 corosync[30580]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:20 m6kvm3 corosync[30580]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm3 corosync[30580]:   [KNET  ] host: host: 2 has no active links


node4
-----
Sep  3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm4 corosync[4085]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: received all states
Sep  3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: all data is up to date
Sep  3 10:39:21 m6kvm4 corosync[4085]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:21 m6kvm4 corosync[4085]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:21 m6kvm4 corosync[4085]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:20 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log
Sep  3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log

node5
-----
Sep  3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm5 corosync[41830]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: received all states
Sep  3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: all data is up to date
Sep  3 10:39:20 m6kvm5 corosync[41830]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:20 m6kvm5 corosync[41830]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm5 corosync[41830]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log
Sep  3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log





node6
-----
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm6 corosync[36694]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:20 m6kvm6 corosync[36694]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:20 m6kvm6 corosync[36694]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm6 corosync[36694]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm6 corosync[36694]:   [KNET  ] host: host: 2 has no active links



node7
-----
Sep  3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm7 corosync[15467]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: received all states
Sep  3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date
Sep  3 10:39:19 m6kvm7 corosync[15467]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:19 m6kvm7 corosync[15467]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:19 m6kvm7 corosync[15467]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:20 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log
---> here the others nodes reboot almost at the same time
Sep  3 10:40:25 m6kvm7 corosync[15467]:   [KNET  ] link: host: 3 link: 0 is down
Sep  3 10:40:25 m6kvm7 corosync[15467]:   [KNET  ] link: host: 14 link: 0 is down
Sep  3 10:40:25 m6kvm7 corosync[15467]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep  3 10:40:25 m6kvm7 corosync[15467]:   [KNET  ] host: host: 3 has no active links
Sep  3 10:40:25 m6kvm7 corosync[15467]:   [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
Sep  3 10:40:25 m6kvm7 corosync[15467]:   [KNET  ] host: host: 14 has no active links
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] link: host: 13 link: 0 is down
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] link: host: 8 link: 0 is down
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] link: host: 6 link: 0 is down
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] link: host: 10 link: 0 is down
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] host: host: 13 (passive) best link: 0 (pri: 1)
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] host: host: 13 has no active links
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] host: host: 8 has no active links
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] host: host: 6 has no active links
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Sep  3 10:40:27 m6kvm7 corosync[15467]:   [KNET  ] host: host: 10 has no active links
Sep  3 10:40:28 m6kvm7 corosync[15467]:   [TOTEM ] Token has not been received in 4505 ms 
Sep  3 10:40:29 m6kvm7 corosync[15467]:   [KNET  ] link: host: 11 link: 0 is down
Sep  3 10:40:29 m6kvm7 corosync[15467]:   [KNET  ] link: host: 9 link: 0 is down
Sep  3 10:40:29 m6kvm7 corosync[15467]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Sep  3 10:40:29 m6kvm7 corosync[15467]:   [KNET  ] host: host: 11 has no active links
Sep  3 10:40:29 m6kvm7 corosync[15467]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Sep  3 10:40:29 m6kvm7 corosync[15467]:   [KNET  ] host: host: 9 has no active links
Sep  3 10:40:30 m6kvm7 corosync[15467]:   [TOTEM ] A processor failed, forming new configuration.
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] link: host: 4 link: 0 is down
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] link: host: 12 link: 0 is down
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] link: host: 1 link: 0 is down
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] host: host: 4 has no active links
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] host: host: 12 (passive) best link: 0 (pri: 1)
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] host: host: 12 has no active links
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep  3 10:40:31 m6kvm7 corosync[15467]:   [KNET  ] host: host: 1 has no active links
Sep  3 10:40:34 m6kvm7 corosync[15467]:   [KNET  ] link: host: 5 link: 0 is down
Sep  3 10:40:34 m6kvm7 corosync[15467]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep  3 10:40:34 m6kvm7 corosync[15467]:   [KNET  ] host: host: 5 has no active links
Sep  3 10:40:41 m6kvm7 corosync[15467]:   [TOTEM ] A new membership (7.82f) was formed. Members left: 1 3 4 5 6 8 9 10 11 12 13 14
Sep  3 10:40:41 m6kvm7 corosync[15467]:   [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6 8 9 10 11 12 13 14
Sep  3 10:40:41 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 12 received
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892
Sep  3 10:40:41 m6kvm7 corosync[15467]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep  3 10:40:41 m6kvm7 corosync[15467]:   [QUORUM] Members[1]: 7
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date
Sep  3 10:40:41 m6kvm7 corosync[15467]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: dfsm_deliver_queue: queue length 51
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: node lost quorum
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: members: 7/15892
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: received write while not quorate - trigger resync
Sep  3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: leaving CPG group
Sep  3 10:40:41 m6kvm7 pve-ha-crm[16196]: node 'm6kvm2': state changed from 'online' => 'unknown'
Sep  3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] notice: start cluster connection
Sep  3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: cpg_join failed: 14
Sep  3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: can't initialize service
Sep  3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892
Sep  3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date
Sep  3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum!
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)
Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied
Sep  3 10:40:56 m6kvm7 pve-ha-lrm[16140]: status change active => lost_agent_lock
Sep  3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change master => lost_manager_lock
Sep  3 10:40:56 m6kvm7 pve-ha-crm[16196]: watchdog closed (disabled)
Sep  3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change lost_manager_lock => wait_for_quorum
Sep  3 10:43:23 m6kvm7 corosync[15467]:   [KNET  ] rx: host: 4 link: 0 is up
Sep  3 10:43:23 m6kvm7 corosync[15467]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep  3 10:43:24 m6kvm7 corosync[15467]:   [TOTEM ] A new membership (4.834) was formed. Members joined: 4
Sep  3 10:43:24 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:43:24 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 4/3965, 7/15892
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: members: 4/3965, 7/15892
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation
Sep  3 10:43:24 m6kvm7 corosync[15467]:   [QUORUM] Members[2]: 4 7
Sep  3 10:43:24 m6kvm7 corosync[15467]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 4/3965/00000002)
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 4/3965/00000002)
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 7/15892
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 7/15892
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: start sending inode updates
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: sent all (4) updates
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received all states
Sep  3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date
Sep  3 10:44:05 m6kvm7 corosync[15467]:   [KNET  ] rx: host: 3 link: 0 is up
Sep  3 10:44:05 m6kvm7 corosync[15467]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Sep  3 10:44:05 m6kvm7 corosync[15467]:   [TOTEM ] A new membership (3.838) was formed. Members joined: 3
Sep  3 10:44:05 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:44:05 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:44:05 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 3/3613, 4/3965, 7/15892
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: members: 3/3613, 4/3965, 7/15892
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation
Sep  3 10:44:05 m6kvm7 corosync[15467]:   [QUORUM] Members[3]: 3 4 7
Sep  3 10:44:05 m6kvm7 corosync[15467]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 3/3613/00000002)
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 3/3613/00000002)
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 4/3965
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 4/3965, 7/15892
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received all states
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date
Sep  3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: dfsm_deliver_queue: queue length 1
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [KNET  ] rx: host: 1 link: 0 is up
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [TOTEM ] A new membership (1.83c) was formed. Members joined: 1
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [CPG   ] downlist left_list: 0 received
Sep  3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3552, 3/3613, 4/3965, 7/15892
Sep  3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation
Sep  3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3552, 3/3613, 4/3965, 7/15892
Sep  3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [QUORUM] Members[4]: 1 3 4 7
Sep  3 10:44:31 m6kvm7 corosync[15467]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3552/00000002)
Sep  3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3552/00000002)
Sep  3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states
Sep  3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 3/3613
Sep  3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 3/3613, 4/3965, 7/15892
Sep  3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date
Sep  3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received all states
Sep  3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date
Sep  3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log
Sep  3 10:49:18 m6kvm7 pmxcfs[15892]: [status] notice: received log




node8
-----

Sep  3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm8 corosync[24361]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: received all states
Sep  3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: all data is up to date
Sep  3 10:39:20 m6kvm8 corosync[24361]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:20 m6kvm8 corosync[24361]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm8 corosync[24361]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log
Sep  3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log
Sep  3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log
Sep  3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log


--> reboot



node9
-----
Sep  3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm9 corosync[22340]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: received all states
Sep  3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: all data is up to date
Sep  3 10:39:21 m6kvm9 corosync[22340]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:21 m6kvm9 corosync[22340]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:21 m6kvm9 corosync[22340]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:20 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log
Sep  3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log


--> reboot



node10
------
Sep  3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm10 corosync[41458]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: received all states
Sep  3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: all data is up to date
Sep  3 10:39:21 m6kvm10 corosync[41458]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:21 m6kvm10 corosync[41458]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:21 m6kvm10 corosync[41458]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log
Sep  3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log

--> reboot


node11
------
Sep  3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm11 corosync[12455]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: received all states
Sep  3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: all data is up to date
Sep  3 10:39:20 m6kvm11 corosync[12455]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:20 m6kvm11 corosync[12455]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm11 corosync[12455]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log
Sep  3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log
Sep  3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log
Sep  3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log
Sep  3 10:40:20 m6kvm11 pmxcfs[12983]: [status] notice: received log




node12
------
Sep  3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm12 corosync[43716]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm12 pmxcfs[44214]: [dcdb] notice: cpg_send_message retried 1 times
Sep  3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: received all states
Sep  3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: all data is up to date
Sep  3 10:39:20 m6kvm12 corosync[43716]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:20 m6kvm12 corosync[43716]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm12 corosync[43716]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:20 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log
Sep  3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log





node13
------
Sep  3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm13 corosync[39182]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: cpg_send_message retried 1 times
Sep  3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: received all states
Sep  3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: all data is up to date
Sep  3 10:39:19 m6kvm13 corosync[39182]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:19 m6kvm13 corosync[39182]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:19 m6kvm13 corosync[39182]:   [KNET  ] host: host: 2 has no active links
Sep  3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log
Sep  3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log
Sep  3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log
Sep  3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log
--> reboot

node14
------
Sep  3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930
Sep  3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: starting data syncronisation
Sep  3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: received sync request (epoch 1/3527/00000002)
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [TOTEM ] A new membership (1.82b) was formed. Members left: 2
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [CPG   ] downlist left_list: 1 received
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14
Sep  3 10:39:16 m6kvm14 corosync[42413]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep  3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: received all states
Sep  3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: all data is up to date
Sep  3 10:39:20 m6kvm14 corosync[42413]:   [KNET  ] link: host: 2 link: 0 is down
Sep  3 10:39:20 m6kvm14 corosync[42413]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep  3 10:39:20 m6kvm14 corosync[42413]:   [KNET  ] host: host: 2 has no active links

--> reboot



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-03 14:11 [pve-devel] corosync bug: cluster break after 1 node clean shutdown Alexandre DERUMIER
@ 2020-09-04 12:29 ` Alexandre DERUMIER
  2020-09-04 15:42   ` Dietmar Maurer
  2020-12-29 14:21   ` Josef Johansson
  2020-09-04 15:46 ` Alexandre DERUMIER
  2020-09-30 15:50 ` Thomas Lamprecht
  2 siblings, 2 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-04 12:29 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: pve-devel

BTW,

do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ?

I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..)

Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ?



----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "pve-devel" <pve-devel@pve.proxmox.com>
Envoyé: Jeudi 3 Septembre 2020 16:11:56
Objet: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Hi, 

I had a problem this morning with corosync, after shutting down cleanly a node from the cluster. 
This is a 14 nodes cluster, node2 was shutted. (with "halt" command) 

HA was actived, and all nodes have been rebooted. 
(I have already see this problem some months ago, without HA enabled, and it was like half of the nodes didn't see others nodes, like 2 cluster formations) 

Some users have reported similar problems when adding a new node 
https://forum.proxmox.com/threads/quorum-lost-when-adding-new-7th-node.75197/ 


Here the full logs of the nodes, for each node I'm seeing quorum with 13 nodes. (so it's seem to be ok). 
I didn't have time to look live on server, as HA have restarted them. 
So, I don't known if it's a corosync bug or something related to crm,lrm,pmxcs (but I don't see any special logs) 

Only node7 have survived a little bit longer,and I had stop lrm/crm to avoid reboot. 

libknet1: 1.15-pve1 
corosync: 3.0.3-pve1 


Any ideas before submiting bug to corosync mailing ? (I'm seeing new libknet && corosync version on their github, I don't have read the changelog yet) 




node1 
----- 
Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm1 corosync[3678]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm1 corosync[3678]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm1 corosync[3678]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [dcdb] notice: cpg_send_message retried 1 times 
Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: all data is up to date 
Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 has no active links 

--> reboot 

node2 : shutdown log 
----- 
Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd received signal 15: Terminated 
Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302Y9120CGN.ata.state 
Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302MG120CGN.ata.state 
Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd is exiting (exit status 0) 
Sep 3 10:39:05 m6kvm2 nrpe[20095]: Caught SIGTERM - shutting down... 
Sep 3 10:39:05 m6kvm2 nrpe[20095]: Daemon shutdown 
Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> starting task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: 
Sep 3 10:39:07 m6kvm2 pve-guests[38552]: all VMs and CTs stopped 
Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> end task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: OK 
Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: received signal TERM 
Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server closing 
Sep 3 10:39:07 m6kvm2 spiceproxy[36572]: worker exit 
Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: worker 36572 finished 
Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server stopped 
Sep 3 10:39:07 m6kvm2 pvestatd[3786]: received signal TERM 
Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server closing 
Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server stopped 
Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: received signal TERM 
Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: got shutdown request with shutdown policy 'conditional' 
Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: shutdown LRM, stop all services 
Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: received terminate request (signal) 
Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: stopping pvefw logger 
Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: watchdog closed (disabled) 
Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: server stopped 
Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: received signal TERM 
Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: server received shutdown request 
Sep 3 10:39:10 m6kvm2 pveproxy[24735]: received signal TERM 
Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server closing 
Sep 3 10:39:10 m6kvm2 pveproxy[29731]: worker exit 
Sep 3 10:39:10 m6kvm2 pveproxy[30873]: worker exit 
Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 30873 finished 
Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 31391 finished 
Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 29731 finished 
Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server stopped 
Sep 3 10:39:14 m6kvm2 pve-ha-crm[12847]: server stopped 


node3 
----- 
Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm3 corosync[30580]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm3 corosync[30580]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm3 corosync[30580]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: cpg_send_message retried 1 times 
Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: all data is up to date 
Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 has no active links 


node4 
----- 
Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm4 corosync[4085]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: all data is up to date 
Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:20 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 

node5 
----- 
Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm5 corosync[41830]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: all data is up to date 
Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 





node6 
----- 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm6 corosync[36694]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 has no active links 



node7 
----- 
Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date 
Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:20 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
---> here the others nodes reboot almost at the same time 
Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 3 link: 0 is down 
Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 14 link: 0 is down 
Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 has no active links 
Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 has no active links 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 13 link: 0 is down 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 8 link: 0 is down 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 6 link: 0 is down 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 10 link: 0 is down 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 has no active links 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 has no active links 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 has no active links 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 has no active links 
Sep 3 10:40:28 m6kvm7 corosync[15467]: [TOTEM ] Token has not been received in 4505 ms 
Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 11 link: 0 is down 
Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 9 link: 0 is down 
Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 has no active links 
Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 has no active links 
Sep 3 10:40:30 m6kvm7 corosync[15467]: [TOTEM ] A processor failed, forming new configuration. 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 4 link: 0 is down 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 12 link: 0 is down 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 1 link: 0 is down 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 has no active links 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 has no active links 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 has no active links 
Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] link: host: 5 link: 0 is down 
Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1) 
Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 has no active links 
Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] A new membership (7.82f) was formed. Members left: 1 3 4 5 6 8 9 10 11 12 13 14 
Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6 8 9 10 11 12 13 14 
Sep 3 10:40:41 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 12 received 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 
Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] This node is within the non-primary component and will NOT provide any services. 
Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] Members[1]: 7 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
Sep 3 10:40:41 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: dfsm_deliver_queue: queue length 51 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: node lost quorum 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: members: 7/15892 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: received write while not quorate - trigger resync 
Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: leaving CPG group 
Sep 3 10:40:41 m6kvm7 pve-ha-crm[16196]: node 'm6kvm2': state changed from 'online' => 'unknown' 
Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] notice: start cluster connection 
Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: cpg_join failed: 14 
Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: can't initialize service 
Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 
Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! 
Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 
Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied 
Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied 
Sep 3 10:40:56 m6kvm7 pve-ha-lrm[16140]: status change active => lost_agent_lock 
Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change master => lost_manager_lock 
Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: watchdog closed (disabled) 
Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change lost_manager_lock => wait_for_quorum 
Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] rx: host: 4 link: 0 is up 
Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) 
Sep 3 10:43:24 m6kvm7 corosync[15467]: [TOTEM ] A new membership (4.834) was formed. Members joined: 4 
Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 4/3965, 7/15892 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: members: 4/3965, 7/15892 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation 
Sep 3 10:43:24 m6kvm7 corosync[15467]: [QUORUM] Members[2]: 4 7 
Sep 3 10:43:24 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 4/3965/00000002) 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 4/3965/00000002) 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 7/15892 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 7/15892 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: start sending inode updates 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: sent all (4) updates 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received all states 
Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date 
Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] rx: host: 3 link: 0 is up 
Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) 
Sep 3 10:44:05 m6kvm7 corosync[15467]: [TOTEM ] A new membership (3.838) was formed. Members joined: 3 
Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 3/3613, 4/3965, 7/15892 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: members: 3/3613, 4/3965, 7/15892 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation 
Sep 3 10:44:05 m6kvm7 corosync[15467]: [QUORUM] Members[3]: 3 4 7 
Sep 3 10:44:05 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 3/3613/00000002) 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 3/3613/00000002) 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 4/3965 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 4/3965, 7/15892 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received all states 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date 
Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: dfsm_deliver_queue: queue length 1 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] rx: host: 1 link: 0 is up 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.83c) was formed. Members joined: 1 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 
Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation 
Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 
Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [QUORUM] Members[4]: 1 3 4 7 
Sep 3 10:44:31 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3552/00000002) 
Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3552/00000002) 
Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states 
Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 3/3613 
Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 3/3613, 4/3965, 7/15892 
Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received all states 
Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date 
Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log 
Sep 3 10:49:18 m6kvm7 pmxcfs[15892]: [status] notice: received log 




node8 
----- 

Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm8 corosync[24361]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: all data is up to date 
Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log 
Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log 
Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log 
Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log 


--> reboot 



node9 
----- 
Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm9 corosync[22340]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: all data is up to date 
Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:20 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 


--> reboot 



node10 
------ 
Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm10 corosync[41458]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: all data is up to date 
Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log 
Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log 

--> reboot 


node11 
------ 
Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm11 corosync[12455]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: all data is up to date 
Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log 
Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log 
Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log 
Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log 
Sep 3 10:40:20 m6kvm11 pmxcfs[12983]: [status] notice: received log 




node12 
------ 
Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm12 corosync[43716]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [dcdb] notice: cpg_send_message retried 1 times 
Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: all data is up to date 
Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:20 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 





node13 
------ 
Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm13 corosync[39182]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: cpg_send_message retried 1 times 
Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: all data is up to date 
Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 has no active links 
Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log 
Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log 
Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log 
Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log 
--> reboot 

node14 
------ 
Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: starting data syncronisation 
Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: received sync request (epoch 1/3527/00000002) 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 3 10:39:16 m6kvm14 corosync[42413]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: received all states 
Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: all data is up to date 
Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] link: host: 2 link: 0 is down 
Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 has no active links 

--> reboot 

_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-04 12:29 ` Alexandre DERUMIER
@ 2020-09-04 15:42   ` Dietmar Maurer
  2020-09-05 13:32     ` Alexandre DERUMIER
  2020-12-29 14:21   ` Josef Johansson
  1 sibling, 1 reply; 84+ messages in thread
From: Dietmar Maurer @ 2020-09-04 15:42 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER; +Cc: pve-devel

> do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ?

I would try to find the bug instead.

> I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..)
> 
> Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ?

Was this even related to corosync? What exactly caused the reboot?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-03 14:11 [pve-devel] corosync bug: cluster break after 1 node clean shutdown Alexandre DERUMIER
  2020-09-04 12:29 ` Alexandre DERUMIER
@ 2020-09-04 15:46 ` Alexandre DERUMIER
  2020-09-30 15:50 ` Thomas Lamprecht
  2 siblings, 0 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-04 15:46 UTC (permalink / raw)
  To: pve-devel

What I'm not sure, is what was the running libknet version.

package was upgraded everywhere to 1.16,  but I have notice that corosync process is not restarted when libknet is upgraded.

so it's quite possible that corosync was still running 1.13 libknet. (I need to find a way to  find last corosync or node  restart).

I think we should force corosync restart  on libknet upgrade. (or maybe bump corosync package version at the same time)



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-04 15:42   ` Dietmar Maurer
@ 2020-09-05 13:32     ` Alexandre DERUMIER
  2020-09-05 15:23       ` dietmar
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-05 13:32 UTC (permalink / raw)
  To: dietmar; +Cc: Proxmox VE development discussion, pve-devel

> Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? 

>
>>Was this even related to corosync? What exactly caused the reboot?

Hi Dietmar,

what I'm 100% sure, it that the watchdog have reboot all the servers. (I have watchdog trace in ipmi)

That's happen, just after shutdown the server. 
What is strange is that corosync logs on all servers show that they correctly the node down, and see other nodes.

So, I really don't known.
Maybe corosync what hanging ?


I don't have any other logs from crm/lrm/pmxcs...

I'm really blind. :/




----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>
Cc: "pve-devel" <pve-devel@pve.proxmox.com>
Envoyé: Vendredi 4 Septembre 2020 17:42:45
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ? 

I would try to find the bug instead. 

> I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..) 
> 
> Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? 

Was this even related to corosync? What exactly caused the reboot? 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-05 13:32     ` Alexandre DERUMIER
@ 2020-09-05 15:23       ` dietmar
  2020-09-05 17:30         ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: dietmar @ 2020-09-05 15:23 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Proxmox VE development discussion, pve-devel

> what I'm 100% sure, it that the watchdog have reboot all the servers. (I have watchdog trace in ipmi)

So you are using ipmi hardware watchdog?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-05 15:23       ` dietmar
@ 2020-09-05 17:30         ` Alexandre DERUMIER
  2020-09-06  4:21           ` dietmar
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-05 17:30 UTC (permalink / raw)
  To: dietmar; +Cc: Proxmox VE development discussion, pve-devel


>>So you are using ipmi hardware watchdog?

yes, I'm using dell idrac ipmi card watchdog.


----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com>
Envoyé: Samedi 5 Septembre 2020 17:23:28
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> what I'm 100% sure, it that the watchdog have reboot all the servers. (I have watchdog trace in ipmi) 

So you are using ipmi hardware watchdog? 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-05 17:30         ` Alexandre DERUMIER
@ 2020-09-06  4:21           ` dietmar
  2020-09-06  5:36             ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: dietmar @ 2020-09-06  4:21 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Proxmox VE development discussion, pve-devel

> >>So you are using ipmi hardware watchdog?
> 
> yes, I'm using dell idrac ipmi card watchdog

But the pve logs look ok, and there is no indication
that we stopped updating the watchdog. So why did the
watchdog trigger? Maybe an IPMI bug?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-06  4:21           ` dietmar
@ 2020-09-06  5:36             ` Alexandre DERUMIER
  2020-09-06  6:33               ` Alexandre DERUMIER
  2020-09-06  8:43               ` Alexandre DERUMIER
  0 siblings, 2 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-06  5:36 UTC (permalink / raw)
  To: dietmar; +Cc: Proxmox VE development discussion, pve-devel

>>But the pve logs look ok, and there is no indication
>>that we stopped updating the watchdog. So why did the
>>watchdog trigger? Maybe an IPMI bug?

do you mean an ipmi bug on all 13 servers at the same time ?
(I also have 2 supermicro servers in this cluster, but they use same ipmi watchdog driver. (ipmi_watchdog)



I had same kind of with bug once (when stopping a server), on another cluster, 6 months ago.
This was without HA, but different version of corosync, and that time, I was really seeing quorum split in the corosync logs of the servers.


I'll try to reproduce with a virtual cluster with 14 nodes (don't have enough hardware)


Could I be a bug in proxmox HA code, where watchdog is not resetted by LRM anymore?

----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com>
Envoyé: Dimanche 6 Septembre 2020 06:21:55
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> >>So you are using ipmi hardware watchdog? 
> 
> yes, I'm using dell idrac ipmi card watchdog 

But the pve logs look ok, and there is no indication 
that we stopped updating the watchdog. So why did the 
watchdog trigger? Maybe an IPMI bug? 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-06  5:36             ` Alexandre DERUMIER
@ 2020-09-06  6:33               ` Alexandre DERUMIER
  2020-09-06  8:43               ` Alexandre DERUMIER
  1 sibling, 0 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-06  6:33 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: dietmar, pve-devel

Also, I wonder if it could be possible to not use watchdog fencing at all (as option),

if cluster use only shared storages with native disk lock/reservation.

Like ceph rbd for example, with exclusive-lock, you can't write from 2 clients on same rbd,
so ha will not be able to start qemu on another node.


 

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "dietmar" <dietmar@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com>
Envoyé: Dimanche 6 Septembre 2020 07:36:10
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>But the pve logs look ok, and there is no indication 
>>that we stopped updating the watchdog. So why did the 
>>watchdog trigger? Maybe an IPMI bug? 

do you mean an ipmi bug on all 13 servers at the same time ? 
(I also have 2 supermicro servers in this cluster, but they use same ipmi watchdog driver. (ipmi_watchdog) 



I had same kind of with bug once (when stopping a server), on another cluster, 6 months ago. 
This was without HA, but different version of corosync, and that time, I was really seeing quorum split in the corosync logs of the servers. 


I'll try to reproduce with a virtual cluster with 14 nodes (don't have enough hardware) 


Could I be a bug in proxmox HA code, where watchdog is not resetted by LRM anymore? 

----- Mail original ----- 
De: "dietmar" <dietmar@proxmox.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com> 
Envoyé: Dimanche 6 Septembre 2020 06:21:55 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

> >>So you are using ipmi hardware watchdog? 
> 
> yes, I'm using dell idrac ipmi card watchdog 

But the pve logs look ok, and there is no indication 
that we stopped updating the watchdog. So why did the 
watchdog trigger? Maybe an IPMI bug? 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-06  5:36             ` Alexandre DERUMIER
  2020-09-06  6:33               ` Alexandre DERUMIER
@ 2020-09-06  8:43               ` Alexandre DERUMIER
  2020-09-06 12:14                 ` dietmar
  1 sibling, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-06  8:43 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: dietmar, pve-devel

Maybe something interesting, the only survived node was node7, and it was the crm master

I'm also seein crm disabling watchdog, and also some "loop take too long" messages



(some migration logs from node2 to node1 before the maintenance)
Sep  3 10:36:29 m6kvm7 pve-ha-crm[16196]: service 'vm:992': state changed from 'migrate' to 'started'  (node = m6kvm1)
Sep  3 10:36:29 m6kvm7 pve-ha-crm[16196]: service 'vm:993': state changed from 'migrate' to 'started'  (node = m6kvm1)
Sep  3 10:36:29 m6kvm7 pve-ha-crm[16196]: service 'vm:997': state changed from 'migrate' to 'started'  (node = m6kvm1)
....

Sep  3 10:40:41 m6kvm7 pve-ha-crm[16196]: node 'm6kvm2': state changed from 'online' => 'unknown'
Sep  3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum!
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)
Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied
Sep  3 10:40:56 m6kvm7 pve-ha-lrm[16140]: status change active => lost_agent_lock
Sep  3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change master => lost_manager_lock
Sep  3 10:40:56 m6kvm7 pve-ha-crm[16196]: watchdog closed (disabled)
Sep  3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change lost_manager_lock => wait_for_quorum



others nodes timing
--------------------

10:39:16 ->  node2 shutdown, leave coroync

10:40:25 -> other nodes rebooted by watchdog


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "dietmar" <dietmar@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com>
Envoyé: Dimanche 6 Septembre 2020 07:36:10
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>But the pve logs look ok, and there is no indication 
>>that we stopped updating the watchdog. So why did the 
>>watchdog trigger? Maybe an IPMI bug? 

do you mean an ipmi bug on all 13 servers at the same time ? 
(I also have 2 supermicro servers in this cluster, but they use same ipmi watchdog driver. (ipmi_watchdog) 



I had same kind of with bug once (when stopping a server), on another cluster, 6 months ago. 
This was without HA, but different version of corosync, and that time, I was really seeing quorum split in the corosync logs of the servers. 


I'll try to reproduce with a virtual cluster with 14 nodes (don't have enough hardware) 


Could I be a bug in proxmox HA code, where watchdog is not resetted by LRM anymore? 

----- Mail original ----- 
De: "dietmar" <dietmar@proxmox.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com> 
Envoyé: Dimanche 6 Septembre 2020 06:21:55 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

> >>So you are using ipmi hardware watchdog? 
> 
> yes, I'm using dell idrac ipmi card watchdog 

But the pve logs look ok, and there is no indication 
that we stopped updating the watchdog. So why did the 
watchdog trigger? Maybe an IPMI bug? 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-06  8:43               ` Alexandre DERUMIER
@ 2020-09-06 12:14                 ` dietmar
  2020-09-06 12:19                   ` dietmar
  2020-09-07  7:19                   ` Alexandre DERUMIER
  0 siblings, 2 replies; 84+ messages in thread
From: dietmar @ 2020-09-06 12:14 UTC (permalink / raw)
  To: Alexandre DERUMIER, Proxmox VE development discussion; +Cc: pve-devel

> Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
> Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)

Indeed, this should not happen. Do you use a spearate network for corosync? Or
was there high traffic on the network? What kind of maintenance was the reason
for the shutdown?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-06 12:14                 ` dietmar
@ 2020-09-06 12:19                   ` dietmar
  2020-09-07  7:00                     ` Thomas Lamprecht
  2020-09-07  7:19                   ` Alexandre DERUMIER
  1 sibling, 1 reply; 84+ messages in thread
From: dietmar @ 2020-09-06 12:19 UTC (permalink / raw)
  To: Alexandre DERUMIER, Proxmox VE development discussion; +Cc: pve-devel

> On 09/06/2020 2:14 PM dietmar <dietmar@proxmox.com> wrote:
> 
>  
> > Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
> > Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)
> 
> Indeed, this should not happen. Do you use a spearate network for corosync? Or
> was there high traffic on the network? What kind of maintenance was the reason
> for the shutdown?

Do you use the default corosync timeout values, or do you have a special setup?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-06 12:19                   ` dietmar
@ 2020-09-07  7:00                     ` Thomas Lamprecht
  0 siblings, 0 replies; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-07  7:00 UTC (permalink / raw)
  To: Proxmox VE development discussion, dietmar, Alexandre DERUMIER

On 06.09.20 14:19, dietmar wrote:
>> On 09/06/2020 2:14 PM dietmar <dietmar@proxmox.com> wrote:
>>
>>  
>>> Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
>>> Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)
>>
>> Indeed, this should not happen. Do you use a spearate network for corosync? Or
>> was there high traffic on the network? What kind of maintenance was the reason
>> for the shutdown?
> 
> Do you use the default corosync timeout values, or do you have a special setup?
> 


Can you please post the full corosync config?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-06 12:14                 ` dietmar
  2020-09-06 12:19                   ` dietmar
@ 2020-09-07  7:19                   ` Alexandre DERUMIER
  2020-09-07  8:18                     ` dietmar
  1 sibling, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-07  7:19 UTC (permalink / raw)
  To: dietmar; +Cc: Proxmox VE development discussion, pve-devel

>>Indeed, this should not happen. Do you use a spearate network for corosync? 

No, I use 2x40GB lacp link. 

>>was there high traffic on the network? 

but I'm far from saturated them. (in pps or througput),  (I'm around 3-4gbps)


The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)


From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s,
 so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux.


I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug.



>>What kind of maintenance was the reason for the shutdown?

ram upgrade. (the server was running ok before shutdown, no hardware problem)  
(I just shutdown the server, and don't have started it yet when problem occur)



>>Do you use the default corosync timeout values, or do you have a special setup?


no special tuning, default values. (I don't have any retransmit since months in the logs)

>>Can you please post the full corosync config?

(I have verified, the running version was corosync was 3.0.3 with libknet 1.15)


here the config:

"
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: m6kvm1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: m6kvm1
  }
  node {
    name: m6kvm10
    nodeid: 10
    quorum_votes: 1
    ring0_addr: m6kvm10
  }
  node {
    name: m6kvm11
    nodeid: 11
    quorum_votes: 1
    ring0_addr: m6kvm11
  }
  node {
    name: m6kvm12
    nodeid: 12
    quorum_votes: 1
    ring0_addr: m6kvm12
  }
  node {
    name: m6kvm13
    nodeid: 13
    quorum_votes: 1
    ring0_addr: m6kvm13
  }
  node {
    name: m6kvm14
    nodeid: 14
    quorum_votes: 1
    ring0_addr: m6kvm14
  }
  node {
    name: m6kvm2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: m6kvm2
  }
  node {
    name: m6kvm3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: m6kvm3
  }
  node {
    name: m6kvm4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: m6kvm4
  }
  node {
    name: m6kvm5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: m6kvm5
  }
  node {
    name: m6kvm6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: m6kvm6
  }
  node {
    name: m6kvm7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: m6kvm7
  }

  node {
    name: m6kvm8
    nodeid: 8
    quorum_votes: 1
    ring0_addr: m6kvm8
  }
  node {
    name: m6kvm9
    nodeid: 9
    quorum_votes: 1
    ring0_addr: m6kvm9
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: m6kvm
  config_version: 19
  interface {
    bindnetaddr: 10.3.94.89
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  transport: knet
  version: 2
}



----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Cc: "pve-devel" <pve-devel@pve.proxmox.com>
Envoyé: Dimanche 6 Septembre 2020 14:14:06
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 

Indeed, this should not happen. Do you use a spearate network for corosync? Or 
was there high traffic on the network? What kind of maintenance was the reason 
for the shutdown? 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-07  7:19                   ` Alexandre DERUMIER
@ 2020-09-07  8:18                     ` dietmar
  2020-09-07  9:32                       ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: dietmar @ 2020-09-07  8:18 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Proxmox VE development discussion

There is a similar report in the forum:

https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111

No HA involved...


> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier@odiso.com> wrote:
> 
>  
> >>Indeed, this should not happen. Do you use a spearate network for corosync? 
> 
> No, I use 2x40GB lacp link. 
> 
> >>was there high traffic on the network? 
> 
> but I'm far from saturated them. (in pps or througput),  (I'm around 3-4gbps)
> 
> 
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)
> 
> 
> From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s,
>  so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux.
> 
> 
> I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug.
> 
> 
> 
> >>What kind of maintenance was the reason for the shutdown?
> 
> ram upgrade. (the server was running ok before shutdown, no hardware problem)  
> (I just shutdown the server, and don't have started it yet when problem occur)
> 
> 
> 
> >>Do you use the default corosync timeout values, or do you have a special setup?
> 
> 
> no special tuning, default values. (I don't have any retransmit since months in the logs)
> 
> >>Can you please post the full corosync config?
> 
> (I have verified, the running version was corosync was 3.0.3 with libknet 1.15)
> 
> 
> here the config:
> 
> "
> logging {
>   debug: off
>   to_syslog: yes
> }
> 
> nodelist {
>   node {
>     name: m6kvm1
>     nodeid: 1
>     quorum_votes: 1
>     ring0_addr: m6kvm1
>   }
>   node {
>     name: m6kvm10
>     nodeid: 10
>     quorum_votes: 1
>     ring0_addr: m6kvm10
>   }
>   node {
>     name: m6kvm11
>     nodeid: 11
>     quorum_votes: 1
>     ring0_addr: m6kvm11
>   }
>   node {
>     name: m6kvm12
>     nodeid: 12
>     quorum_votes: 1
>     ring0_addr: m6kvm12
>   }
>   node {
>     name: m6kvm13
>     nodeid: 13
>     quorum_votes: 1
>     ring0_addr: m6kvm13
>   }
>   node {
>     name: m6kvm14
>     nodeid: 14
>     quorum_votes: 1
>     ring0_addr: m6kvm14
>   }
>   node {
>     name: m6kvm2
>     nodeid: 2
>     quorum_votes: 1
>     ring0_addr: m6kvm2
>   }
>   node {
>     name: m6kvm3
>     nodeid: 3
>     quorum_votes: 1
>     ring0_addr: m6kvm3
>   }
>   node {
>     name: m6kvm4
>     nodeid: 4
>     quorum_votes: 1
>     ring0_addr: m6kvm4
>   }
>   node {
>     name: m6kvm5
>     nodeid: 5
>     quorum_votes: 1
>     ring0_addr: m6kvm5
>   }
>   node {
>     name: m6kvm6
>     nodeid: 6
>     quorum_votes: 1
>     ring0_addr: m6kvm6
>   }
>   node {
>     name: m6kvm7
>     nodeid: 7
>     quorum_votes: 1
>     ring0_addr: m6kvm7
>   }
> 
>   node {
>     name: m6kvm8
>     nodeid: 8
>     quorum_votes: 1
>     ring0_addr: m6kvm8
>   }
>   node {
>     name: m6kvm9
>     nodeid: 9
>     quorum_votes: 1
>     ring0_addr: m6kvm9
>   }
> }
> 
> quorum {
>   provider: corosync_votequorum
> }
> 
> totem {
>   cluster_name: m6kvm
>   config_version: 19
>   interface {
>     bindnetaddr: 10.3.94.89
>     ringnumber: 0
>   }
>   ip_version: ipv4
>   secauth: on
>   transport: knet
>   version: 2
> }
> 
> 
> 
> ----- Mail original -----
> De: "dietmar" <dietmar@proxmox.com>
> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
> Cc: "pve-devel" <pve-devel@pve.proxmox.com>
> Envoyé: Dimanche 6 Septembre 2020 14:14:06
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
> 
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 
> 
> Indeed, this should not happen. Do you use a spearate network for corosync? Or 
> was there high traffic on the network? What kind of maintenance was the reason 
> for the shutdown?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-07  8:18                     ` dietmar
@ 2020-09-07  9:32                       ` Alexandre DERUMIER
  2020-09-07 13:23                         ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-07  9:32 UTC (permalink / raw)
  To: dietmar; +Cc: Proxmox VE development discussion

>>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 
>>
>>No HA involved... 

I had already help this user some week ago

https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboot-need-help.74643/#post-333093

HA was actived at this time. (Maybe the watchdog was still running, I'm not sure if you disable HA from all vms if LRM disable the watchdog ?)


----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 7 Septembre 2020 10:18:42
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

There is a similar report in the forum: 

https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 

No HA involved... 


> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier@odiso.com> wrote: 
> 
> 
> >>Indeed, this should not happen. Do you use a spearate network for corosync? 
> 
> No, I use 2x40GB lacp link. 
> 
> >>was there high traffic on the network? 
> 
> but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) 
> 
> 
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) 
> 
> 
> From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s, 
> so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux. 
> 
> 
> I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug. 
> 
> 
> 
> >>What kind of maintenance was the reason for the shutdown? 
> 
> ram upgrade. (the server was running ok before shutdown, no hardware problem) 
> (I just shutdown the server, and don't have started it yet when problem occur) 
> 
> 
> 
> >>Do you use the default corosync timeout values, or do you have a special setup? 
> 
> 
> no special tuning, default values. (I don't have any retransmit since months in the logs) 
> 
> >>Can you please post the full corosync config? 
> 
> (I have verified, the running version was corosync was 3.0.3 with libknet 1.15) 
> 
> 
> here the config: 
> 
> " 
> logging { 
> debug: off 
> to_syslog: yes 
> } 
> 
> nodelist { 
> node { 
> name: m6kvm1 
> nodeid: 1 
> quorum_votes: 1 
> ring0_addr: m6kvm1 
> } 
> node { 
> name: m6kvm10 
> nodeid: 10 
> quorum_votes: 1 
> ring0_addr: m6kvm10 
> } 
> node { 
> name: m6kvm11 
> nodeid: 11 
> quorum_votes: 1 
> ring0_addr: m6kvm11 
> } 
> node { 
> name: m6kvm12 
> nodeid: 12 
> quorum_votes: 1 
> ring0_addr: m6kvm12 
> } 
> node { 
> name: m6kvm13 
> nodeid: 13 
> quorum_votes: 1 
> ring0_addr: m6kvm13 
> } 
> node { 
> name: m6kvm14 
> nodeid: 14 
> quorum_votes: 1 
> ring0_addr: m6kvm14 
> } 
> node { 
> name: m6kvm2 
> nodeid: 2 
> quorum_votes: 1 
> ring0_addr: m6kvm2 
> } 
> node { 
> name: m6kvm3 
> nodeid: 3 
> quorum_votes: 1 
> ring0_addr: m6kvm3 
> } 
> node { 
> name: m6kvm4 
> nodeid: 4 
> quorum_votes: 1 
> ring0_addr: m6kvm4 
> } 
> node { 
> name: m6kvm5 
> nodeid: 5 
> quorum_votes: 1 
> ring0_addr: m6kvm5 
> } 
> node { 
> name: m6kvm6 
> nodeid: 6 
> quorum_votes: 1 
> ring0_addr: m6kvm6 
> } 
> node { 
> name: m6kvm7 
> nodeid: 7 
> quorum_votes: 1 
> ring0_addr: m6kvm7 
> } 
> 
> node { 
> name: m6kvm8 
> nodeid: 8 
> quorum_votes: 1 
> ring0_addr: m6kvm8 
> } 
> node { 
> name: m6kvm9 
> nodeid: 9 
> quorum_votes: 1 
> ring0_addr: m6kvm9 
> } 
> } 
> 
> quorum { 
> provider: corosync_votequorum 
> } 
> 
> totem { 
> cluster_name: m6kvm 
> config_version: 19 
> interface { 
> bindnetaddr: 10.3.94.89 
> ringnumber: 0 
> } 
> ip_version: ipv4 
> secauth: on 
> transport: knet 
> version: 2 
> } 
> 
> 
> 
> ----- Mail original ----- 
> De: "dietmar" <dietmar@proxmox.com> 
> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
> Cc: "pve-devel" <pve-devel@pve.proxmox.com> 
> Envoyé: Dimanche 6 Septembre 2020 14:14:06 
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 
> 
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 
> 
> Indeed, this should not happen. Do you use a spearate network for corosync? Or 
> was there high traffic on the network? What kind of maintenance was the reason 
> for the shutdown? 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-07  9:32                       ` Alexandre DERUMIER
@ 2020-09-07 13:23                         ` Alexandre DERUMIER
  2020-09-08  4:41                           ` dietmar
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-07 13:23 UTC (permalink / raw)
  To: Proxmox VE development discussion

Looking at theses logs:

Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied

in PVE/HA/Env/PVE2.pm
"
    my $ctime = time();
    my $last_lock_time = $last->{lock_time} // 0;
    my $last_got_lock = $last->{got_lock};

    my $retry_timeout = 120; # hardcoded lock lifetime limit from pmxcfs

    eval {

        mkdir $lockdir;

        # pve cluster filesystem not online
        die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir;

        if (($ctime - $last_lock_time) < $retry_timeout) {
            # try cfs lock update request (utime)
            if (utime(0, $ctime, $filename))  {
                $got_lock = 1;
                return;
            }
            die "cfs lock update failed - $!\n";
        }
"


If the retry_timeout is = 120, could it explain why I don't have log on others node, if the watchdog trigger after 60s ?

I don't known too much how locks are working in pmxcfs, but when a corosync member leave or join, and a new cluster memership is formed,
could we have some lock lost or hang ?



----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "dietmar" <dietmar@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 7 Septembre 2020 11:32:13
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 
>> 
>>No HA involved... 

I had already help this user some week ago 

https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboot-need-help.74643/#post-333093 

HA was actived at this time. (Maybe the watchdog was still running, I'm not sure if you disable HA from all vms if LRM disable the watchdog ?) 


----- Mail original ----- 
De: "dietmar" <dietmar@proxmox.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Lundi 7 Septembre 2020 10:18:42 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

There is a similar report in the forum: 

https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 

No HA involved... 


> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier@odiso.com> wrote: 
> 
> 
> >>Indeed, this should not happen. Do you use a spearate network for corosync? 
> 
> No, I use 2x40GB lacp link. 
> 
> >>was there high traffic on the network? 
> 
> but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) 
> 
> 
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) 
> 
> 
> From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s, 
> so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux. 
> 
> 
> I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug. 
> 
> 
> 
> >>What kind of maintenance was the reason for the shutdown? 
> 
> ram upgrade. (the server was running ok before shutdown, no hardware problem) 
> (I just shutdown the server, and don't have started it yet when problem occur) 
> 
> 
> 
> >>Do you use the default corosync timeout values, or do you have a special setup? 
> 
> 
> no special tuning, default values. (I don't have any retransmit since months in the logs) 
> 
> >>Can you please post the full corosync config? 
> 
> (I have verified, the running version was corosync was 3.0.3 with libknet 1.15) 
> 
> 
> here the config: 
> 
> " 
> logging { 
> debug: off 
> to_syslog: yes 
> } 
> 
> nodelist { 
> node { 
> name: m6kvm1 
> nodeid: 1 
> quorum_votes: 1 
> ring0_addr: m6kvm1 
> } 
> node { 
> name: m6kvm10 
> nodeid: 10 
> quorum_votes: 1 
> ring0_addr: m6kvm10 
> } 
> node { 
> name: m6kvm11 
> nodeid: 11 
> quorum_votes: 1 
> ring0_addr: m6kvm11 
> } 
> node { 
> name: m6kvm12 
> nodeid: 12 
> quorum_votes: 1 
> ring0_addr: m6kvm12 
> } 
> node { 
> name: m6kvm13 
> nodeid: 13 
> quorum_votes: 1 
> ring0_addr: m6kvm13 
> } 
> node { 
> name: m6kvm14 
> nodeid: 14 
> quorum_votes: 1 
> ring0_addr: m6kvm14 
> } 
> node { 
> name: m6kvm2 
> nodeid: 2 
> quorum_votes: 1 
> ring0_addr: m6kvm2 
> } 
> node { 
> name: m6kvm3 
> nodeid: 3 
> quorum_votes: 1 
> ring0_addr: m6kvm3 
> } 
> node { 
> name: m6kvm4 
> nodeid: 4 
> quorum_votes: 1 
> ring0_addr: m6kvm4 
> } 
> node { 
> name: m6kvm5 
> nodeid: 5 
> quorum_votes: 1 
> ring0_addr: m6kvm5 
> } 
> node { 
> name: m6kvm6 
> nodeid: 6 
> quorum_votes: 1 
> ring0_addr: m6kvm6 
> } 
> node { 
> name: m6kvm7 
> nodeid: 7 
> quorum_votes: 1 
> ring0_addr: m6kvm7 
> } 
> 
> node { 
> name: m6kvm8 
> nodeid: 8 
> quorum_votes: 1 
> ring0_addr: m6kvm8 
> } 
> node { 
> name: m6kvm9 
> nodeid: 9 
> quorum_votes: 1 
> ring0_addr: m6kvm9 
> } 
> } 
> 
> quorum { 
> provider: corosync_votequorum 
> } 
> 
> totem { 
> cluster_name: m6kvm 
> config_version: 19 
> interface { 
> bindnetaddr: 10.3.94.89 
> ringnumber: 0 
> } 
> ip_version: ipv4 
> secauth: on 
> transport: knet 
> version: 2 
> } 
> 
> 
> 
> ----- Mail original ----- 
> De: "dietmar" <dietmar@proxmox.com> 
> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
> Cc: "pve-devel" <pve-devel@pve.proxmox.com> 
> Envoyé: Dimanche 6 Septembre 2020 14:14:06 
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 
> 
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 
> 
> Indeed, this should not happen. Do you use a spearate network for corosync? Or 
> was there high traffic on the network? What kind of maintenance was the reason 
> for the shutdown? 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-07 13:23                         ` Alexandre DERUMIER
@ 2020-09-08  4:41                           ` dietmar
  2020-09-08  7:11                             ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: dietmar @ 2020-09-08  4:41 UTC (permalink / raw)
  To: Alexandre DERUMIER, Proxmox VE development discussion

> I don't known too much how locks are working in pmxcfs, but when a corosync member leave or join, and a new cluster memership is formed, could we have some lock lost or hang ?

It would really help if we can reproduce the bug somehow. Do you have and idea how
to trigger the bug?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-08  4:41                           ` dietmar
@ 2020-09-08  7:11                             ` Alexandre DERUMIER
  2020-09-09 20:05                               ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-08  7:11 UTC (permalink / raw)
  To: dietmar; +Cc: Proxmox VE development discussion

>>It would really help if we can reproduce the bug somehow. Do you have and idea how
>>to trigger the bug?

I really don't known. I'm currently trying to reproduce on the same cluster, with softdog && noboot=1, and rebooting node.


Maybe it's related with the number of vms, or the number of nodes, don't have any clue ...






----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 8 Septembre 2020 06:41:10
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> I don't known too much how locks are working in pmxcfs, but when a corosync member leave or join, and a new cluster memership is formed, could we have some lock lost or hang ? 

It would really help if we can reproduce the bug somehow. Do you have and idea how 
to trigger the bug? 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-08  7:11                             ` Alexandre DERUMIER
@ 2020-09-09 20:05                               ` Thomas Lamprecht
  2020-09-10  4:58                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-09 20:05 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER, dietmar

On 08.09.20 09:11, Alexandre DERUMIER wrote:
>>> It would really help if we can reproduce the bug somehow. Do you have and idea how
>>> to trigger the bug?
> 
> I really don't known. I'm currently trying to reproduce on the same cluster, with softdog && noboot=1, and rebooting node.
> 
> 
> Maybe it's related with the number of vms, or the number of nodes, don't have any clue ...

I checked a bit the watchdog code, our user-space mux one and the kernel drivers,
and just noting a few things here (thinking out aloud):

The /dev/watchdog itself is always active, else we could  loose it to some
other program and not be able to activate HA dynamically.
But, as long as no HA service got active, it's a simple dummy "wake up every
second and do an ioctl keep-alive update".
This is really simple and efficiently written, so if that fails for over 10s
the systems is really loaded, probably barely responding to anything.

Currently the watchdog-mux runs as normal process, no re-nice, no real-time
scheduling. This is IMO wrong, as it is a critical process which needs to be
run with high priority. I've a patch here which sets it to the highest RR
realtime-scheduling priority available, effectively the same what corosync does.


diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index 818ae00..71981d7 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -8,2 +8,3 @@
 #include <time.h>
+#include <sched.h>
 #include <sys/ioctl.h>
@@ -151,2 +177,15 @@ main(void)
 
+    int sched_priority = sched_get_priority_max (SCHED_RR);
+    if (sched_priority != -1) {
+        struct sched_param global_sched_param;
+        global_sched_param.sched_priority = sched_priority;
+        int res = sched_setscheduler (0, SCHED_RR, &global_sched_param);
+        if (res == -1) {
+            fprintf(stderr, "Could not set SCHED_RR at priority %d\n", sched_priority);
+        } else {
+            fprintf(stderr, "set SCHED_RR at priority %d\n", sched_priority);
+        }
+    }
+
+
     if ((watchdog_fd = open(WATCHDOG_DEV, O_WRONLY)) == -1) {

The issue with no HA but watchdog reset due to massively overloaded system
should be avoided already a lot with the scheduling change alone.

Interesting, IMO, is that lots of nodes rebooted at the same time, with no HA active.
This *could* come from a side-effect like ceph rebalacing kicking off and producing
a load spike for >10s, hindering the scheduling of the watchdog-mux.
This is a theory, but with HA off it needs to be something like that, as in HA-off
case there's *no* direct or indirect connection between corosync/pmxcfs and the
watchdog-mux. It simply does not cares, or notices, quorum partition changes at all.


There may be a approach to reserve the watchdog for the mux, but avoid having it
as "ticking time bomb":
Theoretically one could open it, then disable it with an ioctl (it can be queried
if a driver support that) and only enable it for real once the first client connects
to the MUX. This may not work for all watchdog modules, and if, we may want to make
it configurable, as some people actually want a reset if a (future) real-time process
cannot be scheduled for >= 10 seconds.

With HA active, well then there could be something off, either in corosync/knet or
also in how we interface with it in pmxcfs, that could well be, but won't explain the
non-HA issues.

Speaking of pmxcfs, that one runs also with standard priority, we may want to change
that too to a RT scheduler, so that its ensured it can process all corosync events.

I have also a few other small watchdog mux patches around, it should nowadays actually
be able to tell us why a reset happened (can also be over/under voltage, temperature,
...) and I'll repeat doing the ioctl for keep-alive a few times if it fails, can only
win with that after all.





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-09 20:05                               ` Thomas Lamprecht
@ 2020-09-10  4:58                                 ` Alexandre DERUMIER
  2020-09-10  8:21                                   ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-10  4:58 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion, dietmar

Thanks Thomas for the investigations.

I'm still trying to reproduce...
I think I have some special case here, because the user of the forum with 30 nodes had corosync cluster split. (Note that I had this bug 6 months ago,when shuting down a node too, and the only way was stop full stop corosync on all nodes, and start corosync again on all nodes).


But this time, corosync logs looks fine. (every node, correctly see node2 down, and see remaning nodes)

surviving node7, was the only node with HA, and LRM didn't have enable watchog (I don't have found any log like "pve-ha-lrm: watchdog active" for the last 6months on this nodes


So, the timing was:

10:39:05 : "halt" command is send to node2
10:39:16 : node2 is leaving corosync / halt  -> every node is seeing it and correctly do a new membership with 13 remaining nodes

...don't see any special logs (corosync,pmxcfs,pve-ha-crm,pve-ha-lrm) after the node2 leaving.
But they are still activity on the server, pve-firewall is still logging, vms are running fine


between 10:40:25 - 10:40:34 : watchdog reset nodes, but not node7.

-> so between 70s-80s after the node2 was done, so I think that watchdog-mux was still running fine until that.
   (That's sound like lrm was stuck, and client_watchdog_timeout have expired in watchdog-mux) 



10:40:41 node7, loose quorum (as all others nodes have reset),



10:40:50: node7 crm/lrm finally log.

Sep  3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum!
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)
Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied



So, I really think that something have stucked lrm/crm loop, and watchdog was not resetted because of that.





----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>, "dietmar" <dietmar@proxmox.com>
Envoyé: Mercredi 9 Septembre 2020 22:05:49
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 08.09.20 09:11, Alexandre DERUMIER wrote: 
>>> It would really help if we can reproduce the bug somehow. Do you have and idea how 
>>> to trigger the bug? 
> 
> I really don't known. I'm currently trying to reproduce on the same cluster, with softdog && noboot=1, and rebooting node. 
> 
> 
> Maybe it's related with the number of vms, or the number of nodes, don't have any clue ... 

I checked a bit the watchdog code, our user-space mux one and the kernel drivers, 
and just noting a few things here (thinking out aloud): 

The /dev/watchdog itself is always active, else we could loose it to some 
other program and not be able to activate HA dynamically. 
But, as long as no HA service got active, it's a simple dummy "wake up every 
second and do an ioctl keep-alive update". 
This is really simple and efficiently written, so if that fails for over 10s 
the systems is really loaded, probably barely responding to anything. 

Currently the watchdog-mux runs as normal process, no re-nice, no real-time 
scheduling. This is IMO wrong, as it is a critical process which needs to be 
run with high priority. I've a patch here which sets it to the highest RR 
realtime-scheduling priority available, effectively the same what corosync does. 


diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c 
index 818ae00..71981d7 100644 
--- a/src/watchdog-mux.c 
+++ b/src/watchdog-mux.c 
@@ -8,2 +8,3 @@ 
#include <time.h> 
+#include <sched.h> 
#include <sys/ioctl.h> 
@@ -151,2 +177,15 @@ main(void) 

+ int sched_priority = sched_get_priority_max (SCHED_RR); 
+ if (sched_priority != -1) { 
+ struct sched_param global_sched_param; 
+ global_sched_param.sched_priority = sched_priority; 
+ int res = sched_setscheduler (0, SCHED_RR, &global_sched_param); 
+ if (res == -1) { 
+ fprintf(stderr, "Could not set SCHED_RR at priority %d\n", sched_priority); 
+ } else { 
+ fprintf(stderr, "set SCHED_RR at priority %d\n", sched_priority); 
+ } 
+ } 
+ 
+ 
if ((watchdog_fd = open(WATCHDOG_DEV, O_WRONLY)) == -1) { 

The issue with no HA but watchdog reset due to massively overloaded system 
should be avoided already a lot with the scheduling change alone. 

Interesting, IMO, is that lots of nodes rebooted at the same time, with no HA active. 
This *could* come from a side-effect like ceph rebalacing kicking off and producing 
a load spike for >10s, hindering the scheduling of the watchdog-mux. 
This is a theory, but with HA off it needs to be something like that, as in HA-off 
case there's *no* direct or indirect connection between corosync/pmxcfs and the 
watchdog-mux. It simply does not cares, or notices, quorum partition changes at all. 


There may be a approach to reserve the watchdog for the mux, but avoid having it 
as "ticking time bomb": 
Theoretically one could open it, then disable it with an ioctl (it can be queried 
if a driver support that) and only enable it for real once the first client connects 
to the MUX. This may not work for all watchdog modules, and if, we may want to make 
it configurable, as some people actually want a reset if a (future) real-time process 
cannot be scheduled for >= 10 seconds. 

With HA active, well then there could be something off, either in corosync/knet or 
also in how we interface with it in pmxcfs, that could well be, but won't explain the 
non-HA issues. 

Speaking of pmxcfs, that one runs also with standard priority, we may want to change 
that too to a RT scheduler, so that its ensured it can process all corosync events. 

I have also a few other small watchdog mux patches around, it should nowadays actually 
be able to tell us why a reset happened (can also be over/under voltage, temperature, 
...) and I'll repeat doing the ioctl for keep-alive a few times if it fails, can only 
win with that after all. 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-10  4:58                                 ` Alexandre DERUMIER
@ 2020-09-10  8:21                                   ` Thomas Lamprecht
  2020-09-10 11:34                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-10  8:21 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER

On 10.09.20 06:58, Alexandre DERUMIER wrote:
> Thanks Thomas for the investigations.
> 
> I'm still trying to reproduce...
> I think I have some special case here, because the user of the forum with 30 nodes had corosync cluster split. (Note that I had this bug 6 months ago,when shuting down a node too, and the only way was stop full stop corosync on all nodes, and start corosync again on all nodes).
> 
> 
> But this time, corosync logs looks fine. (every node, correctly see node2 down, and see remaning nodes)
> 
> surviving node7, was the only node with HA, and LRM didn't have enable watchog (I don't have found any log like "pve-ha-lrm: watchdog active" for the last 6months on this nodes
> 
> 
> So, the timing was:
> 
> 10:39:05 : "halt" command is send to node2
> 10:39:16 : node2 is leaving corosync / halt  -> every node is seeing it and correctly do a new membership with 13 remaining nodes
> 
> ...don't see any special logs (corosync,pmxcfs,pve-ha-crm,pve-ha-lrm) after the node2 leaving.
> But they are still activity on the server, pve-firewall is still logging, vms are running fine
> 
> 
> between 10:40:25 - 10:40:34 : watchdog reset nodes, but not node7.
> 
> -> so between 70s-80s after the node2 was done, so I think that watchdog-mux was still running fine until that.
>    (That's sound like lrm was stuck, and client_watchdog_timeout have expired in watchdog-mux)

as said, if the other nodes where not using HA, the watchdog-mux had no
client which could expire.

> 
> 10:40:41 node7, loose quorum (as all others nodes have reset),

> 10:40:50: node7 crm/lrm finally log.
> 
> Sep  3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum!
> Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
> Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)

above lines also indicate very high load.

Do you have some monitoring which shows the CPU/IO load before/during this event?

> Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
> Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied
> 
> 
> 
> So, I really think that something have stucked lrm/crm loop, and watchdog was not resetted because of that.
> 





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-10  8:21                                   ` Thomas Lamprecht
@ 2020-09-10 11:34                                     ` Alexandre DERUMIER
  2020-09-10 18:21                                       ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-10 11:34 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion

>>as said, if the other nodes where not using HA, the watchdog-mux had no
>>client which could expire.

sorry, maybe I have wrong explained it,
but all my nodes had HA enabled.

I have double check lrm_status json files from my morning backup 2h before the problem,
they were all in "active" state. ("state":"active","mode":"active" )

I don't why node7 don't have rebooted, the only difference is that is was the crm master.
(I think crm also reset the watchdog counter ? maybe behaviour is different than lrm ?)




>>above lines also indicate very high load. 
>>Do you have some monitoring which shows the CPU/IO load before/during this event? 

load (1,5,15 ) was: 6  (for 48cores), cpu usage: 23%
no iowait on disk (vms are on a remote ceph, only proxmox services are running on local ssd disk)

so nothing strange here :/


----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Alexandre Derumier" <aderumier@odiso.com>
Envoyé: Jeudi 10 Septembre 2020 10:21:48
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 10.09.20 06:58, Alexandre DERUMIER wrote: 
> Thanks Thomas for the investigations. 
> 
> I'm still trying to reproduce... 
> I think I have some special case here, because the user of the forum with 30 nodes had corosync cluster split. (Note that I had this bug 6 months ago,when shuting down a node too, and the only way was stop full stop corosync on all nodes, and start corosync again on all nodes). 
> 
> 
> But this time, corosync logs looks fine. (every node, correctly see node2 down, and see remaning nodes) 
> 
> surviving node7, was the only node with HA, and LRM didn't have enable watchog (I don't have found any log like "pve-ha-lrm: watchdog active" for the last 6months on this nodes 
> 
> 
> So, the timing was: 
> 
> 10:39:05 : "halt" command is send to node2 
> 10:39:16 : node2 is leaving corosync / halt -> every node is seeing it and correctly do a new membership with 13 remaining nodes 
> 
> ...don't see any special logs (corosync,pmxcfs,pve-ha-crm,pve-ha-lrm) after the node2 leaving. 
> But they are still activity on the server, pve-firewall is still logging, vms are running fine 
> 
> 
> between 10:40:25 - 10:40:34 : watchdog reset nodes, but not node7. 
> 
> -> so between 70s-80s after the node2 was done, so I think that watchdog-mux was still running fine until that. 
> (That's sound like lrm was stuck, and client_watchdog_timeout have expired in watchdog-mux) 

as said, if the other nodes where not using HA, the watchdog-mux had no 
client which could expire. 

> 
> 10:40:41 node7, loose quorum (as all others nodes have reset), 

> 10:40:50: node7 crm/lrm finally log. 
> 
> Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! 
> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 

above lines also indicate very high load. 

Do you have some monitoring which shows the CPU/IO load before/during this event? 

> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied 
> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied 
> 
> 
> 
> So, I really think that something have stucked lrm/crm loop, and watchdog was not resetted because of that. 
> 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-10 11:34                                     ` Alexandre DERUMIER
@ 2020-09-10 18:21                                       ` Thomas Lamprecht
  2020-09-14  4:54                                         ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-10 18:21 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER

On 10.09.20 13:34, Alexandre DERUMIER wrote:
>>> as said, if the other nodes where not using HA, the watchdog-mux had no
>>> client which could expire.
> 
> sorry, maybe I have wrong explained it,
> but all my nodes had HA enabled.
> 
> I have double check lrm_status json files from my morning backup 2h before the problem,
> they were all in "active" state. ("state":"active","mode":"active" )
> 

OK, so all had a connection to the watchdog-mux open. This shifts the suspicion
again over to pmxcfs and/or corosync.

> I don't why node7 don't have rebooted, the only difference is that is was the crm master.
> (I think crm also reset the watchdog counter ? maybe behaviour is different than lrm ?)

The watchdog-mux stops updating the real watchdog as soon any client disconnects or times
out. It does not know which client (daemon) that was.

>>> above lines also indicate very high load. 
>>> Do you have some monitoring which shows the CPU/IO load before/during this event? 
> 
> load (1,5,15 ) was: 6  (for 48cores), cpu usage: 23%
> no iowait on disk (vms are on a remote ceph, only proxmox services are running on local ssd disk)
> 
> so nothing strange here :/

Hmm, the long loop times could then be the effect of a pmxcfs read or write
operation being (temporarily) stuck.





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-10 18:21                                       ` Thomas Lamprecht
@ 2020-09-14  4:54                                         ` Alexandre DERUMIER
  2020-09-14  7:14                                           ` Dietmar Maurer
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-14  4:54 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion


I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ?

http://manpages.ubuntu.com/manpages/bionic/man8/sbd.8.html

(shared disk heartbeat).

Something like a independent daemon (not using corosync/pmxcfs/...), also connected to watchdog muxer.

----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>
Envoyé: Jeudi 10 Septembre 2020 20:21:14
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 10.09.20 13:34, Alexandre DERUMIER wrote: 
>>> as said, if the other nodes where not using HA, the watchdog-mux had no 
>>> client which could expire. 
> 
> sorry, maybe I have wrong explained it, 
> but all my nodes had HA enabled. 
> 
> I have double check lrm_status json files from my morning backup 2h before the problem, 
> they were all in "active" state. ("state":"active","mode":"active" ) 
> 

OK, so all had a connection to the watchdog-mux open. This shifts the suspicion 
again over to pmxcfs and/or corosync. 

> I don't why node7 don't have rebooted, the only difference is that is was the crm master. 
> (I think crm also reset the watchdog counter ? maybe behaviour is different than lrm ?) 

The watchdog-mux stops updating the real watchdog as soon any client disconnects or times 
out. It does not know which client (daemon) that was. 

>>> above lines also indicate very high load. 
>>> Do you have some monitoring which shows the CPU/IO load before/during this event? 
> 
> load (1,5,15 ) was: 6 (for 48cores), cpu usage: 23% 
> no iowait on disk (vms are on a remote ceph, only proxmox services are running on local ssd disk) 
> 
> so nothing strange here :/ 

Hmm, the long loop times could then be the effect of a pmxcfs read or write 
operation being (temporarily) stuck. 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-14  4:54                                         ` Alexandre DERUMIER
@ 2020-09-14  7:14                                           ` Dietmar Maurer
  2020-09-14  8:27                                             ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Dietmar Maurer @ 2020-09-14  7:14 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER, Thomas Lamprecht

> I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ?

AFAIK Thomas already has patches to implement active fencing.

But IMHO this will not solve the corosync problems..




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-14  7:14                                           ` Dietmar Maurer
@ 2020-09-14  8:27                                             ` Alexandre DERUMIER
  2020-09-14  8:51                                               ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-14  8:27 UTC (permalink / raw)
  To: dietmar; +Cc: Proxmox VE development discussion, Thomas Lamprecht

> I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? 

>>AFAIK Thomas already has patches to implement active fencing. 

>>But IMHO this will not solve the corosync problems.. 

Yes, sure. I'm really to have to 2 differents sources of verification, with different path/software, to avoid this kind of bug.
(shit happens, murphy law ;)

as we say in French "ceinture & bretelles" -> "belt and braces"


BTW,
a user have reported new corosync problem here:
https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871
(Sound like the bug that I have 6month ago, with corosync bug flooding a lof of udp packets, but not the same bug I have here)




----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Envoyé: Lundi 14 Septembre 2020 09:14:50
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? 

AFAIK Thomas already has patches to implement active fencing. 

But IMHO this will not solve the corosync problems.. 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-14  8:27                                             ` Alexandre DERUMIER
@ 2020-09-14  8:51                                               ` Thomas Lamprecht
  2020-09-14 15:45                                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-14  8:51 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER, dietmar

On 9/14/20 10:27 AM, Alexandre DERUMIER wrote:
>> I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? 
> 
>>> AFAIK Thomas already has patches to implement active fencing. 
> 
>>> But IMHO this will not solve the corosync problems.. 
> 
> Yes, sure. I'm really to have to 2 differents sources of verification, with different path/software, to avoid this kind of bug.
> (shit happens, murphy law ;)

would then need at least three, and if one has a bug flooding the network in
a lot of setups (not having beefy switches like you ;) the other two will be
taken down also, either as memory or the system stack gets overloaded.

> 
> as we say in French "ceinture & bretelles" -> "belt and braces"
> 
> 
> BTW,
> a user have reported new corosync problem here:
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871
> (Sound like the bug that I have 6month ago, with corosync bug flooding a lof of udp packets, but not the same bug I have here)

Did you get in contact with knet/corosync devs about this?

Because, it may well be something their stack is better at handling it, maybe
there's also really still a bug, or bad behaviour on some edge cases...




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-14  8:51                                               ` Thomas Lamprecht
@ 2020-09-14 15:45                                                 ` Alexandre DERUMIER
  2020-09-15  5:45                                                   ` dietmar
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-14 15:45 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion, dietmar

>>Did you get in contact with knet/corosync devs about this? 
>>Because, it may well be something their stack is better at handling it, maybe 
>>there's also really still a bug, or bad behaviour on some edge cases... 

not yet, I would like to have more infos to submit, because I'm blind.
I have enabled debug logs on all my cluster if that happen again.


BTW,
I have noticed something, 

corosync is stopped after syslog stop, so at shutdown we never have corosync log


I have edit corosync.service

- After=network-online.target
+ After=network-online.target syslog.target


and now It's logging correctly.



Now, that logging work, I'm also seeeing pmxcfs errors when corosync is stopping.
(But no pmxcfs shutdown log)

Do you think it's possible to have a clean shutdown of pmxcfs first, before stopping corosync ?


"
Sep 14 17:23:49 pve corosync[1346]:   [MAIN  ] Node was shut down by a signal
Sep 14 17:23:49 pve systemd[1]: Stopping Corosync Cluster Engine...
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Unloading all Corosync service engines.
Sep 14 17:23:49 pve corosync[1346]:   [QB    ] withdrawing server sockets
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: cmap_dispatch failed: 2
Sep 14 17:23:49 pve corosync[1346]:   [QB    ] withdrawing server sockets
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync configuration map access
Sep 14 17:23:49 pve corosync[1346]:   [QB    ] withdrawing server sockets
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync configuration service
Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_dispatch failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_leave failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_dispatch failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_leave failed: 2
Sep 14 17:23:49 pve corosync[1346]:   [QB    ] withdrawing server sockets
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: quorum_dispatch failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [status] notice: node lost quorum
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync profile loading service
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync watchdog service
Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: quorum_initialize failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: can't initialize service
Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: cmap_initialize failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: can't initialize service
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] notice: start cluster connection
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_initialize failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: can't initialize service
Sep 14 17:23:49 pve pmxcfs[1132]: [status] notice: start cluster connection
Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_initialize failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: can't initialize service
Sep 14 17:23:50 pve corosync[1346]:   [MAIN  ] Corosync Cluster Engine exiting normally
"



----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>, "dietmar" <dietmar@proxmox.com>
Envoyé: Lundi 14 Septembre 2020 10:51:03
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/14/20 10:27 AM, Alexandre DERUMIER wrote: 
>> I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? 
> 
>>> AFAIK Thomas already has patches to implement active fencing. 
> 
>>> But IMHO this will not solve the corosync problems.. 
> 
> Yes, sure. I'm really to have to 2 differents sources of verification, with different path/software, to avoid this kind of bug. 
> (shit happens, murphy law ;) 

would then need at least three, and if one has a bug flooding the network in 
a lot of setups (not having beefy switches like you ;) the other two will be 
taken down also, either as memory or the system stack gets overloaded. 

> 
> as we say in French "ceinture & bretelles" -> "belt and braces" 
> 
> 
> BTW, 
> a user have reported new corosync problem here: 
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871 
> (Sound like the bug that I have 6month ago, with corosync bug flooding a lof of udp packets, but not the same bug I have here) 

Did you get in contact with knet/corosync devs about this? 

Because, it may well be something their stack is better at handling it, maybe 
there's also really still a bug, or bad behaviour on some edge cases... 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-14 15:45                                                 ` Alexandre DERUMIER
@ 2020-09-15  5:45                                                   ` dietmar
  2020-09-15  6:27                                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: dietmar @ 2020-09-15  5:45 UTC (permalink / raw)
  To: Alexandre DERUMIER, Thomas Lamprecht; +Cc: Proxmox VE development discussion

> Now, that logging work, I'm also seeeing pmxcfs errors when corosync is stopping.
> (But no pmxcfs shutdown log)
> 
> Do you think it's possible to have a clean shutdown of pmxcfs first, before stopping corosync ?

This is by intention - we do not want to stop pmxcfs only because coorosync service stops.

Unit]
Description=The Proxmox VE cluster filesystem
ConditionFileIsExecutable=/usr/bin/pmxcfs
Wants=corosync.service
Wants=rrdcached.service
Before=corosync.service
Before=ceph.service
Before=cron.service
After=network.target
After=sys-fs-fuse-connections.mount
After=time-sync.target
After=rrdcached.service
DefaultDependencies=no
Before=shutdown.target
Conflicts=shutdown.target




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15  5:45                                                   ` dietmar
@ 2020-09-15  6:27                                                     ` Alexandre DERUMIER
  2020-09-15  7:13                                                       ` dietmar
  2020-09-15  7:58                                                       ` Thomas Lamprecht
  0 siblings, 2 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15  6:27 UTC (permalink / raw)
  To: dietmar; +Cc: Thomas Lamprecht, Proxmox VE development discussion

>>This is by intention - we do not want to stop pmxcfs only because coorosync service stops. 

Yes, but at shutdown, it could be great to stop pmxcfs before corosync ?
I ask the question, because the 2 times I have problem, it was when shutting down a server.
So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ?


looking at the pve-cluster unit file,
why do we have "Before=corosync.service" and not "After=corosync.service" ?

I have tried to change this, but even with that, both are still shutting down in parallel.

the only way I have found to have clean shutdown, is "Requires=corosync.server" + "After=corosync.service".
But that mean than if you restart corosync, it's restart pmxcfs too first.

I have looked at systemd doc, After= should be enough (as at shutdown it's doing the reverse order),
but I don't known why corosync don't wait than pve-cluster ???


(Also, I think than pmxcfs is also stopping after syslog, because I never see the pmxcfs "teardown filesystem" logs at shutdown)



----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "aderumier" <aderumier@odiso.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 07:45:41
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> Now, that logging work, I'm also seeeing pmxcfs errors when corosync is stopping. 
> (But no pmxcfs shutdown log) 
> 
> Do you think it's possible to have a clean shutdown of pmxcfs first, before stopping corosync ? 

This is by intention - we do not want to stop pmxcfs only because coorosync service stops. 

Unit] 
Description=The Proxmox VE cluster filesystem 
ConditionFileIsExecutable=/usr/bin/pmxcfs 
Wants=corosync.service 
Wants=rrdcached.service 
Before=corosync.service 
Before=ceph.service 
Before=cron.service 
After=network.target 
After=sys-fs-fuse-connections.mount 
After=time-sync.target 
After=rrdcached.service 
DefaultDependencies=no 
Before=shutdown.target 
Conflicts=shutdown.target 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15  6:27                                                     ` Alexandre DERUMIER
@ 2020-09-15  7:13                                                       ` dietmar
  2020-09-15  8:42                                                         ` Alexandre DERUMIER
  2020-09-15  7:58                                                       ` Thomas Lamprecht
  1 sibling, 1 reply; 84+ messages in thread
From: dietmar @ 2020-09-15  7:13 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Thomas Lamprecht, Proxmox VE development discussion

> I ask the question, because the 2 times I have problem, it was when shutting down a server.
> So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ?

pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15  6:27                                                     ` Alexandre DERUMIER
  2020-09-15  7:13                                                       ` dietmar
@ 2020-09-15  7:58                                                       ` Thomas Lamprecht
  1 sibling, 0 replies; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-15  7:58 UTC (permalink / raw)
  To: Alexandre DERUMIER, dietmar; +Cc: Proxmox VE development discussion

On 9/15/20 8:27 AM, Alexandre DERUMIER wrote:
>>> This is by intention - we do not want to stop pmxcfs only because coorosync service stops. 
> 
> Yes, but at shutdown, it could be great to stop pmxcfs before corosync ?
> I ask the question, because the 2 times I have problem, it was when shutting down a server.
> So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ?
> 
> 
> looking at the pve-cluster unit file,
> why do we have "Before=corosync.service" and not "After=corosync.service" ?

We may need to sync over the cluster corosync.conf to the local one, that can
only happen before.

Also, if we shutdown pmxcfs before corosync we may still get corosync events (file writes,
locking, ...) but the node does not sees it locally anymore but still looks quorate for
others, that'd be not good.

> 
> I have tried to change this, but even with that, both are still shutting down in parallel.
> 
> the only way I have found to have clean shutdown, is "Requires=corosync.server" + "After=corosync.service".
> But that mean than if you restart corosync, it's restart pmxcfs too first.
> 
> I have looked at systemd doc, After= should be enough (as at shutdown it's doing the reverse order),
> but I don't known why corosync don't wait than pve-cluster ???
> 
> 
> (Also, I think than pmxcfs is also stopping after syslog, because I never see the pmxcfs "teardown filesystem" logs at shutdown)


is that true for (persistent) systemd-journald too? IIRC syslog.target is
deprecated and only rsyslog provides it.

As the next Debian will enable persistent journal by default and we already
use it for everything (IIRC) were we provide an interface to logs, we will
probably not enable rsyslog by default with PVE 7.x

But if we can add some ordering for this to be improved I'm open for it.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15  7:13                                                       ` dietmar
@ 2020-09-15  8:42                                                         ` Alexandre DERUMIER
  2020-09-15  9:35                                                           ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15  8:42 UTC (permalink / raw)
  To: dietmar; +Cc: Thomas Lamprecht, Proxmox VE development discussion

>>pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes.
yes, I understand that, but I was thinking of the case if corosync is in stopping phase (not totally stopped).

Something racy (I really don't known).

I just send 2 patch to start pve-cluster && corosync after syslog, like this we'll have shutdown logs too.


(I'm currently try to reproduce the problem, with reboot loops, but I still can't reproduce it :/ )





----- Mail original -----
De: "dietmar" <dietmar@proxmox.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 09:13:53
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> I ask the question, because the 2 times I have problem, it was when shutting down a server. 
> So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ? 

pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15  8:42                                                         ` Alexandre DERUMIER
@ 2020-09-15  9:35                                                           ` Alexandre DERUMIER
  2020-09-15  9:46                                                             ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15  9:35 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: dietmar, Thomas Lamprecht

Hi,

I have finally reproduce it !

But this is with a corosync restart in cron each 1 minute, on node1

Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes.


here the logs with full corosync debug at the time of last corosync restart. 

node1 (where corosync is restarted each minute)
https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e

node2
https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67

node5

https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273




I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before.


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "dietmar" <dietmar@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Envoyé: Mardi 15 Septembre 2020 10:42:15
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. 
yes, I understand that, but I was thinking of the case if corosync is in stopping phase (not totally stopped). 

Something racy (I really don't known). 

I just send 2 patch to start pve-cluster && corosync after syslog, like this we'll have shutdown logs too. 


(I'm currently try to reproduce the problem, with reboot loops, but I still can't reproduce it :/ ) 





----- Mail original ----- 
De: "dietmar" <dietmar@proxmox.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 09:13:53 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

> I ask the question, because the 2 times I have problem, it was when shutting down a server. 
> So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ? 

pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15  9:35                                                           ` Alexandre DERUMIER
@ 2020-09-15  9:46                                                             ` Thomas Lamprecht
  2020-09-15 10:15                                                               ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-15  9:46 UTC (permalink / raw)
  To: Alexandre DERUMIER, Proxmox VE development discussion

On 9/15/20 11:35 AM, Alexandre DERUMIER wrote:
> Hi,
> 
> I have finally reproduce it !
> 
> But this is with a corosync restart in cron each 1 minute, on node1
>
> Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes.
> 
> here the logs with full corosync debug at the time of last corosync restart. 
> 
> node1 (where corosync is restarted each minute)
> https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e
> 
> node2
> https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67
> 
> node5
> https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273
> 
> I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before.

Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21

> Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds)




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15  9:46                                                             ` Thomas Lamprecht
@ 2020-09-15 10:15                                                               ` Alexandre DERUMIER
  2020-09-15 11:04                                                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15 10:15 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion, dietmar

here the previous restart log

node1 -> corosync restart at  10:46:15
-----
https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7


node2
-----
https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba



node5
------
https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37

----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Cc: "dietmar" <dietmar@proxmox.com>
Envoyé: Mardi 15 Septembre 2020 11:46:51
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have finally reproduce it ! 
> 
> But this is with a corosync restart in cron each 1 minute, on node1 
> 
> Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. 
> 
> here the logs with full corosync debug at the time of last corosync restart. 
> 
> node1 (where corosync is restarted each minute) 
> https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e 
> 
> node2 
> https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 
> 
> node5 
> https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 
> 
> I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. 

Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 

> Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 10:15                                                               ` Alexandre DERUMIER
@ 2020-09-15 11:04                                                                 ` Alexandre DERUMIER
  2020-09-15 12:49                                                                   ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15 11:04 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

also logs of node14, where the lrm was not too long

https://gist.github.com/aderumier/a2e2d6afc7e04646c923ae6f37cb6c2d


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 12:15:47
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

here the previous restart log 

node1 -> corosync restart at 10:46:15 
----- 
https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7 


node2 
----- 
https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba 



node5 
------ 
https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37 

----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Cc: "dietmar" <dietmar@proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 11:46:51 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have finally reproduce it ! 
> 
> But this is with a corosync restart in cron each 1 minute, on node1 
> 
> Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. 
> 
> here the logs with full corosync debug at the time of last corosync restart. 
> 
> node1 (where corosync is restarted each minute) 
> https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e 
> 
> node2 
> https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 
> 
> node5 
> https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 
> 
> I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. 

Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 

> Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 11:04                                                                 ` Alexandre DERUMIER
@ 2020-09-15 12:49                                                                   ` Alexandre DERUMIER
  2020-09-15 13:00                                                                     ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15 12:49 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

Hi,

I have produce it again, 

now I can't write to /etc/pve/ from any node


I have also added some debug logs to pve-ha-lrm, and it was stuck in:
(but if /etc/pve is locked, this is normal)

        if ($fence_request) {
            $haenv->log('err', "node need to be fenced - releasing agent_lock\n");
            $self->set_local_status({ state => 'lost_agent_lock'});
        } elsif (!$self->get_protected_ha_agent_lock()) {
            $self->set_local_status({ state => 'lost_agent_lock'});
        } elsif ($self->{mode} eq 'maintenance') {
            $self->set_local_status({ state => 'maintenance'});
        }


corosync quorum is currently ok

I'm currently digging the logs

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Envoyé: Mardi 15 Septembre 2020 13:04:31
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

also logs of node14, where the lrm was not too long 

https://gist.github.com/aderumier/a2e2d6afc7e04646c923ae6f37cb6c2d 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 12:15:47 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

here the previous restart log 

node1 -> corosync restart at 10:46:15 
----- 
https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7 


node2 
----- 
https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba 



node5 
------ 
https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37 

----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Cc: "dietmar" <dietmar@proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 11:46:51 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have finally reproduce it ! 
> 
> But this is with a corosync restart in cron each 1 minute, on node1 
> 
> Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. 
> 
> here the logs with full corosync debug at the time of last corosync restart. 
> 
> node1 (where corosync is restarted each minute) 
> https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e 
> 
> node2 
> https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 
> 
> node5 
> https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 
> 
> I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. 

Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 

> Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 12:49                                                                   ` Alexandre DERUMIER
@ 2020-09-15 13:00                                                                     ` Thomas Lamprecht
  2020-09-15 14:09                                                                       ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-15 13:00 UTC (permalink / raw)
  To: Alexandre DERUMIER, Proxmox VE development discussion

On 9/15/20 2:49 PM, Alexandre DERUMIER wrote:
> Hi,
> 
> I have produce it again, 
> 
> now I can't write to /etc/pve/ from any node
> 

OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs,
not the HA LRM or watchdog mux itself.

Can you try to give pmxcfs real time scheduling, e.g., by doing:

# systemctl edit pve-cluster

And then add snippet:


[Service]
CPUSchedulingPolicy=rr
CPUSchedulingPriority=99


And restart pve-cluster

> I have also added some debug logs to pve-ha-lrm, and it was stuck in:
> (but if /etc/pve is locked, this is normal)
> 
>         if ($fence_request) {
>             $haenv->log('err', "node need to be fenced - releasing agent_lock\n");
>             $self->set_local_status({ state => 'lost_agent_lock'});
>         } elsif (!$self->get_protected_ha_agent_lock()) {
>             $self->set_local_status({ state => 'lost_agent_lock'});
>         } elsif ($self->{mode} eq 'maintenance') {
>             $self->set_local_status({ state => 'maintenance'});
>         }
> 
> 
> corosync quorum is currently ok
> 
> I'm currently digging the logs
Is your most simplest/stable reproducer still a periodic restart of corosync in one node?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 13:00                                                                     ` Thomas Lamprecht
@ 2020-09-15 14:09                                                                       ` Alexandre DERUMIER
  2020-09-15 14:19                                                                         ` Alexandre DERUMIER
  2020-09-15 14:32                                                                         ` Thomas Lamprecht
  0 siblings, 2 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15 14:09 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion

>>
>>Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>>
>># systemctl edit pve-cluster 
>>
>>And then add snippet: 
>>
>>
>>[Service] 
>>CPUSchedulingPolicy=rr 
>>CPUSchedulingPriority=99 

yes, sure, I'll do it now


> I'm currently digging the logs 
>>Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 

yes, a simple "systemctl restart corosync" on 1 node each minute



After 1hour, it's still locked.

on other nodes, I still have pmxfs logs like:

Sep 15 15:36:31 m6kvm2 pmxcfs[3474]: [status] notice: received log
Sep 15 15:46:21 m6kvm2 pmxcfs[3474]: [status] notice: received log
Sep 15 15:46:23 m6kvm2 pmxcfs[3474]: [status] notice: received log
...


on node1, I just restarted the pve-cluster service with systemctl restart pve-cluster, 
the pmxcfs process was killed, but not able to start it again
and after that the /etc/pve become writable again on others node.

(I don't have rebooted yet node1, if you want more test on pmxcfs)



root@m6kvm1:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2020-09-15 15:52:11 CEST; 3min 29s ago
  Process: 12536 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)

Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Sep 15 15:52:11 m6kvm1 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Start request repeated too quickly.
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Sep 15 15:52:11 m6kvm1 systemd[1]: Failed to start The Proxmox VE cluster filesystem.

manual "pmxcfs -d"
https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e




Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service
Sep 15 14:38:30 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name  m6kvm, version = 20)
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000064)
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000063)
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 23
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received all states
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue length 157
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 4 inode updates)
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 31
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service
Sep 15 14:39:31 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name  m6kvm, version = 20)
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000065)
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000064)
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 20
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received all states
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 9 inode updates)
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 25
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service
Sep 15 14:40:33 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name  m6kvm, version = 20)
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000066)
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000065)
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 23
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received all states
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue length 87
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 6 inode updates)
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 33
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service
Sep 15 14:41:34 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name  m6kvm, version = 20)
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000067)
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000066)
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received all states
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date
Sep 15 14:47:54 m6kvm1 pmxcfs[3491]: [status] notice: received log
Sep 15 15:02:55 m6kvm1 pmxcfs[3491]: [status] notice: received log
Sep 15 15:17:56 m6kvm1 pmxcfs[3491]: [status] notice: received log
Sep 15 15:32:57 m6kvm1 pmxcfs[3491]: [status] notice: received log
Sep 15 15:47:58 m6kvm1 pmxcfs[3491]: [status] notice: received log

----> restart
 2352  [ 15/09/2020 15:52:00 ] systemctl restart pve-cluster


Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] crit: fuse_mount error: Transport endpoint is not connected
Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] notice: exit proxmox configuration filesystem (-1)
Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] crit: fuse_mount error: Transport endpoint is not connected
Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] notice: exit proxmox configuration filesystem (-1)
Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] crit: fuse_mount error: Transport endpoint is not connected
Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] notice: exit proxmox configuration filesystem (-1)
Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] crit: fuse_mount error: Transport endpoint is not connected
Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] notice: exit proxmox configuration filesystem (-1)
Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] crit: fuse_mount error: Transport endpoint is not connected
Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] notice: exit proxmox configuration filesystem (-1)


some interesting dmesg about "pvesr"

[Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds.
[Tue Sep 15 14:45:34 2020]       Tainted: P           O      5.4.60-1-pve #1
[Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 15 14:45:34 2020] pvesr           D    0 19038      1 0x00000080
[Tue Sep 15 14:45:34 2020] Call Trace:
[Tue Sep 15 14:45:34 2020]  __schedule+0x2e6/0x6f0
[Tue Sep 15 14:45:34 2020]  ? filename_parentat.isra.57.part.58+0xf7/0x180
[Tue Sep 15 14:45:34 2020]  schedule+0x33/0xa0
[Tue Sep 15 14:45:34 2020]  rwsem_down_write_slowpath+0x2ed/0x4a0
[Tue Sep 15 14:45:34 2020]  down_write+0x3d/0x40
[Tue Sep 15 14:45:34 2020]  filename_create+0x8e/0x180
[Tue Sep 15 14:45:34 2020]  do_mkdirat+0x59/0x110
[Tue Sep 15 14:45:34 2020]  __x64_sys_mkdir+0x1b/0x20
[Tue Sep 15 14:45:34 2020]  do_syscall_64+0x57/0x190
[Tue Sep 15 14:45:34 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9




----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 15:00:03
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/15/20 2:49 PM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have produce it again, 
> 
> now I can't write to /etc/pve/ from any node 
> 

OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs, 
not the HA LRM or watchdog mux itself. 

Can you try to give pmxcfs real time scheduling, e.g., by doing: 

# systemctl edit pve-cluster 

And then add snippet: 


[Service] 
CPUSchedulingPolicy=rr 
CPUSchedulingPriority=99 


And restart pve-cluster 

> I have also added some debug logs to pve-ha-lrm, and it was stuck in: 
> (but if /etc/pve is locked, this is normal) 
> 
> if ($fence_request) { 
> $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); 
> $self->set_local_status({ state => 'lost_agent_lock'}); 
> } elsif (!$self->get_protected_ha_agent_lock()) { 
> $self->set_local_status({ state => 'lost_agent_lock'}); 
> } elsif ($self->{mode} eq 'maintenance') { 
> $self->set_local_status({ state => 'maintenance'}); 
> } 
> 
> 
> corosync quorum is currently ok 
> 
> I'm currently digging the logs 
Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 14:09                                                                       ` Alexandre DERUMIER
@ 2020-09-15 14:19                                                                         ` Alexandre DERUMIER
  2020-09-15 14:32                                                                         ` Thomas Lamprecht
  1 sibling, 0 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15 14:19 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion

about node1:  /etc/pve directory seem to be in bad state, that's why it can't be mount

ls -lah /etc/pve:
  ??????????   ? ?    ?         ?            ? pve

I have forced an lazy umount
umount -l /etc/pve 

and now it's working fine.

(so maybe when pmxcfs was killed, it don't have cleanly umount /etc/pve ? )


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 16:09:50
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>> 
>>Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>> 
>># systemctl edit pve-cluster 
>> 
>>And then add snippet: 
>> 
>> 
>>[Service] 
>>CPUSchedulingPolicy=rr 
>>CPUSchedulingPriority=99 

yes, sure, I'll do it now 


> I'm currently digging the logs 
>>Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 

yes, a simple "systemctl restart corosync" on 1 node each minute 



After 1hour, it's still locked. 

on other nodes, I still have pmxfs logs like: 

Sep 15 15:36:31 m6kvm2 pmxcfs[3474]: [status] notice: received log 
Sep 15 15:46:21 m6kvm2 pmxcfs[3474]: [status] notice: received log 
Sep 15 15:46:23 m6kvm2 pmxcfs[3474]: [status] notice: received log 
... 


on node1, I just restarted the pve-cluster service with systemctl restart pve-cluster, 
the pmxcfs process was killed, but not able to start it again 
and after that the /etc/pve become writable again on others node. 

(I don't have rebooted yet node1, if you want more test on pmxcfs) 



root@m6kvm1:~# systemctl status pve-cluster 
● pve-cluster.service - The Proxmox VE cluster filesystem 
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled) 
Active: failed (Result: exit-code) since Tue 2020-09-15 15:52:11 CEST; 3min 29s ago 
Process: 12536 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION) 

Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart. 
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5. 
Sep 15 15:52:11 m6kvm1 systemd[1]: Stopped The Proxmox VE cluster filesystem. 
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Start request repeated too quickly. 
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Failed with result 'exit-code'. 
Sep 15 15:52:11 m6kvm1 systemd[1]: Failed to start The Proxmox VE cluster filesystem. 

manual "pmxcfs -d" 
https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e 




Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service 
Sep 15 14:38:30 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000064) 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000063) 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 23 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received all states 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue length 157 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 4 inode updates) 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 31 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service 
Sep 15 14:39:31 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000065) 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000064) 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 20 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received all states 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 9 inode updates) 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 25 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service 
Sep 15 14:40:33 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000066) 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000065) 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 23 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received all states 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue length 87 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 6 inode updates) 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 33 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service 
Sep 15 14:41:34 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000067) 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000066) 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received all states 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date 
Sep 15 14:47:54 m6kvm1 pmxcfs[3491]: [status] notice: received log 
Sep 15 15:02:55 m6kvm1 pmxcfs[3491]: [status] notice: received log 
Sep 15 15:17:56 m6kvm1 pmxcfs[3491]: [status] notice: received log 
Sep 15 15:32:57 m6kvm1 pmxcfs[3491]: [status] notice: received log 
Sep 15 15:47:58 m6kvm1 pmxcfs[3491]: [status] notice: received log 

----> restart 
2352 [ 15/09/2020 15:52:00 ] systemctl restart pve-cluster 


Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] crit: fuse_mount error: Transport endpoint is not connected 
Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] notice: exit proxmox configuration filesystem (-1) 
Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] crit: fuse_mount error: Transport endpoint is not connected 
Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] notice: exit proxmox configuration filesystem (-1) 
Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] crit: fuse_mount error: Transport endpoint is not connected 
Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] notice: exit proxmox configuration filesystem (-1) 
Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] crit: fuse_mount error: Transport endpoint is not connected 
Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] notice: exit proxmox configuration filesystem (-1) 
Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] crit: fuse_mount error: Transport endpoint is not connected 
Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] notice: exit proxmox configuration filesystem (-1) 


some interesting dmesg about "pvesr" 

[Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. 
[Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 
[Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 
[Tue Sep 15 14:45:34 2020] Call Trace: 
[Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 
[Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 
[Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 
[Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 
[Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 
[Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 
[Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 
[Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 
[Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 
[Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 




----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 15:00:03 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 2:49 PM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have produce it again, 
> 
> now I can't write to /etc/pve/ from any node 
> 

OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs, 
not the HA LRM or watchdog mux itself. 

Can you try to give pmxcfs real time scheduling, e.g., by doing: 

# systemctl edit pve-cluster 

And then add snippet: 


[Service] 
CPUSchedulingPolicy=rr 
CPUSchedulingPriority=99 


And restart pve-cluster 

> I have also added some debug logs to pve-ha-lrm, and it was stuck in: 
> (but if /etc/pve is locked, this is normal) 
> 
> if ($fence_request) { 
> $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); 
> $self->set_local_status({ state => 'lost_agent_lock'}); 
> } elsif (!$self->get_protected_ha_agent_lock()) { 
> $self->set_local_status({ state => 'lost_agent_lock'}); 
> } elsif ($self->{mode} eq 'maintenance') { 
> $self->set_local_status({ state => 'maintenance'}); 
> } 
> 
> 
> corosync quorum is currently ok 
> 
> I'm currently digging the logs 
Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 14:09                                                                       ` Alexandre DERUMIER
  2020-09-15 14:19                                                                         ` Alexandre DERUMIER
@ 2020-09-15 14:32                                                                         ` Thomas Lamprecht
  2020-09-15 14:57                                                                           ` Alexandre DERUMIER
  1 sibling, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-15 14:32 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Proxmox VE development discussion

On 9/15/20 4:09 PM, Alexandre DERUMIER wrote:
>>> Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>>>
>>> # systemctl edit pve-cluster 
>>>
>>> And then add snippet: 
>>>
>>>
>>> [Service] 
>>> CPUSchedulingPolicy=rr 
>>> CPUSchedulingPriority=99 
> yes, sure, I'll do it now
> 
> 
>> I'm currently digging the logs 
>>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 
> yes, a simple "systemctl restart corosync" on 1 node each minute
> 
> 
> 
> After 1hour, it's still locked.
> 
> on other nodes, I still have pmxfs logs like:
> 

I mean this is bad, but also great!
Cam you do a coredump of the whole thing and upload it somewhere with the version info
used (for dbgsym package)? That could help a lot.


> manual "pmxcfs -d"
> https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e
> 

Hmm, the fuse connection of the previous one got into a weird state (or something is still
running) but I'd rather say this is a side-effect not directly connected to the real bug.

> 
> some interesting dmesg about "pvesr"
> 
> [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds.
> [Tue Sep 15 14:45:34 2020]       Tainted: P           O      5.4.60-1-pve #1
> [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Tue Sep 15 14:45:34 2020] pvesr           D    0 19038      1 0x00000080
> [Tue Sep 15 14:45:34 2020] Call Trace:
> [Tue Sep 15 14:45:34 2020]  __schedule+0x2e6/0x6f0
> [Tue Sep 15 14:45:34 2020]  ? filename_parentat.isra.57.part.58+0xf7/0x180
> [Tue Sep 15 14:45:34 2020]  schedule+0x33/0xa0
> [Tue Sep 15 14:45:34 2020]  rwsem_down_write_slowpath+0x2ed/0x4a0
> [Tue Sep 15 14:45:34 2020]  down_write+0x3d/0x40
> [Tue Sep 15 14:45:34 2020]  filename_create+0x8e/0x180
> [Tue Sep 15 14:45:34 2020]  do_mkdirat+0x59/0x110
> [Tue Sep 15 14:45:34 2020]  __x64_sys_mkdir+0x1b/0x20
> [Tue Sep 15 14:45:34 2020]  do_syscall_64+0x57/0x190
> [Tue Sep 15 14:45:34 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 

hmm, hangs in mkdir (cluster wide locking)




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 14:32                                                                         ` Thomas Lamprecht
@ 2020-09-15 14:57                                                                           ` Alexandre DERUMIER
  2020-09-15 15:58                                                                             ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15 14:57 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion

>>I mean this is bad, but also great! 
>>Cam you do a coredump of the whole thing and upload it somewhere with the version info 
>>used (for dbgsym package)? That could help a lot.

I'll try to reproduce it again (with the full lock everywhere), and do the coredump.




I have tried the real time scheduling,

but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking
something at next corosync restart.)


this time it was blocked at the same time on a node in:

work {
...
   } elsif ($state eq 'active') {
      ....
        $self->update_lrm_status();


and another node in

        if ($fence_request) {
            $haenv->log('err', "node need to be fenced - releasing agent_lock\n");
            $self->set_local_status({ state => 'lost_agent_lock'});
        } elsif (!$self->get_protected_ha_agent_lock()) {
            $self->set_local_status({ state => 'lost_agent_lock'});
        } elsif ($self->{mode} eq 'maintenance') {
            $self->set_local_status({ state => 'maintenance'});
        }





----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 16:32:52
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: 
>>> Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>>> 
>>> # systemctl edit pve-cluster 
>>> 
>>> And then add snippet: 
>>> 
>>> 
>>> [Service] 
>>> CPUSchedulingPolicy=rr 
>>> CPUSchedulingPriority=99 
> yes, sure, I'll do it now 
> 
> 
>> I'm currently digging the logs 
>>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 
> yes, a simple "systemctl restart corosync" on 1 node each minute 
> 
> 
> 
> After 1hour, it's still locked. 
> 
> on other nodes, I still have pmxfs logs like: 
> 

I mean this is bad, but also great! 
Cam you do a coredump of the whole thing and upload it somewhere with the version info 
used (for dbgsym package)? That could help a lot. 


> manual "pmxcfs -d" 
> https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e 
> 

Hmm, the fuse connection of the previous one got into a weird state (or something is still 
running) but I'd rather say this is a side-effect not directly connected to the real bug. 

> 
> some interesting dmesg about "pvesr" 
> 
> [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. 
> [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 
> [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 
> [Tue Sep 15 14:45:34 2020] Call Trace: 
> [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 
> [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 
> [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 
> [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 
> [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 
> [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 
> [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 
> [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 
> [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 
> [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 
> 

hmm, hangs in mkdir (cluster wide locking) 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 14:57                                                                           ` Alexandre DERUMIER
@ 2020-09-15 15:58                                                                             ` Alexandre DERUMIER
  2020-09-16  7:34                                                                               ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-15 15:58 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

Another small lock at 17:41:09

To be sure, I have done a small loop of write each second in /etc/pve,  node node2.

it's hanging at first corosync restart,   then, on second corosync restart it's working again.

I'll try to improve this tomorrow to be able to debug corosync process
- restarting corosync
  do some write in /etc/pve/
 - and if it's hanging don't restart corosync again



node2: echo test > /etc/pve/test loop
--------------------------------------
Current time : 17:41:01
Current time : 17:41:02
Current time : 17:41:03
Current time : 17:41:04
Current time : 17:41:05
Current time : 17:41:06
Current time : 17:41:07
Current time : 17:41:08
Current time : 17:41:09

hang

Current time : 17:42:05
Current time : 17:42:06
Current time : 17:42:07



node1
-----
Sep 15 17:41:08 m6kvm1 corosync[18145]:   [KNET  ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397
Sep 15 17:41:08 m6kvm1 corosync[18145]:   [KNET  ] pmtud: Starting PMTUD for host: 10 link: 0
Sep 15 17:41:08 m6kvm1 corosync[18145]:   [KNET  ] udp: detected kernel MTU: 1500
Sep 15 17:41:08 m6kvm1 corosync[18145]:   [TOTEM ] Knet pMTU change: 1397
Sep 15 17:41:08 m6kvm1 corosync[18145]:   [KNET  ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397
Sep 15 17:41:08 m6kvm1 corosync[18145]:   [KNET  ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397
Sep 15 17:41:08 m6kvm1 corosync[18145]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-18145-16239-31-zx6KJM/qb)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] connecting to client [16239]
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] connection created
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QUORUM] lib_init_fn: conn=0x556c2918d5f0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QUORUM] got quorum_type request on 0x556c2918d5f0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QUORUM] got trackstart request on 0x556c2918d5f0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QUORUM] sending initial status to 0x556c2918d5f0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QUORUM] sending quorum notification to 0x556c2918d5f0, length = 52
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-18145-16239-32-I7ZZ6e/qb)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] connecting to client [16239]
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] connection created
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CMAP  ] lib_init_fn: conn=0x556c2918ef20
Sep 15 17:41:09 m6kvm1 pmxcfs[16239]: [status] notice: update cluster info (cluster name  m6kvm, version = 20)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-18145-16239-33-6RKbvH/qb)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] connecting to client [16239]
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] connection created
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] lib_init_fn: conn=0x556c2918ad00, cpd=0x556c2918b50c
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] IPC credentials authenticated (/dev/shm/qb-18145-16239-34-GAY5T9/qb)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] connecting to client [16239]
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] connection created
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] lib_init_fn: conn=0x556c2918c740, cpd=0x556c2918ce8c
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] Creating commit token because I am the rep.
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] Saving state aru 5 high seq received 5
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Storing new sequence id for ring 1197
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] entering COMMIT state.
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] got commit token
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] entering RECOVERY state.
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] TRANS [0] member 1:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [0] member 1:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (1.1193)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 5 high delivered 5 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [1] member 2:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [2] member 3:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [3] member 4:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
ep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [4] member 5:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [5] member 6:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [6] member 7:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [7] member 8:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [8] member 9:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [9] member 10:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [10] member 11:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [11] member 12:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [12] member 13:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] position [13] member 14:
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] Did not need to originate any messages in recovery.
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] got commit token
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] Sending initial ORF token
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] Resetting old ring state
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] recovery to regular 1-0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] waiting_trans_ack changed to 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.90) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.91) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.92) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.93) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.94) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.95) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.96) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.97) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.107) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.108) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.109) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.110) 
ep 15 17:41:09 m6kvm1 corosync[18145]:   [MAIN  ] Member joined: r(0) ip(10.3.94.111) 
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [SYNC  ] call init for locally known services
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] entering OPERATIONAL state.
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [TOTEM ] A new membership (1.1197) was formed. Members joined: 2 3 4 5 6 7 8 9 10 11 12 13 14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [SYNC  ] enter sync process
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [SYNC  ] Committing synchronization for corosync configuration map access
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 2
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 3
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 4
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 5
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 6
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 7
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 8
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 9
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 10
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 11
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 12
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] got joinlist message from node 13
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] my downlist: members(old:1 left:0)
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [CPG   ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] Sending nodelist callback. ring_id = 1.1197
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 13
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] total_votes=2, expected_votes=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 13 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 1 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 13
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] total_votes=3, expected_votes=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 13 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 14 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 1 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] total_votes=3, expected_votes=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 13 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 14 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 1 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 2
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] total_votes=4, expected_votes=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 2 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 13 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 14 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] node 1 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 2
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm1 corosync[18145]:   [VOTEQ ] got nodeinfo message from cluster node 3
....
....
next corosync restart

Sep 15 17:42:03 m6kvm1 corosync[18145]:   [MAIN  ] Node was shut down by a signal
Sep 15 17:42:03 m6kvm1 corosync[18145]:   [SERV  ] Unloading all Corosync service engines.
Sep 15 17:42:03 m6kvm1 corosync[18145]:   [QB    ] withdrawing server sockets
Sep 15 17:42:03 m6kvm1 corosync[18145]:   [QB    ] qb_ipcs_unref() - destroying
Sep 15 17:42:03 m6kvm1 corosync[18145]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Sep 15 17:42:03 m6kvm1 corosync[18145]:   [QB    ] qb_ipcs_disconnect(/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) state:2
Sep 15 17:42:03 m6kvm1 pmxcfs[16239]: [confdb] crit: cmap_dispatch failed: 2
Sep 15 17:42:03 m6kvm1 corosync[18145]:   [MAIN  ] cs_ipcs_connection_closed() 
Sep 15 17:42:03 m6kvm1 corosync[18145]:   [CMAP  ] exit_fn for conn=0x556c2918ef20
Sep 15 17:42:03 m6kvm1 corosync[18145]:   [MAIN  ] cs_ipcs_connection_destroyed() 


node2
-----



Sep 15 17:41:05 m6kvm2 corosync[25411]:   [KNET  ] pmtud: Starting PMTUD for host: 10 link: 0
Sep 15 17:41:05 m6kvm2 corosync[25411]:   [KNET  ] udp: detected kernel MTU: 1500
Sep 15 17:41:05 m6kvm2 corosync[25411]:   [KNET  ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397
Sep 15 17:41:07 m6kvm2 corosync[25411]:   [KNET  ] rx: host: 1 link: 0 received pong: 2
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [TOTEM ] entering GATHER state from 11(merge during join).
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:08 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] rx: host: 1 link: 0 is up
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] Knet host change callback. nodeid: 1 reachable: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] pmtud: Starting PMTUD for host: 1 link: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] udp: detected kernel MTU: 1500
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [KNET  ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] got commit token
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] Saving state aru 123 high seq received 123
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [MAIN  ] Storing new sequence id for ring 1197
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] entering COMMIT state.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] got commit token
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] entering RECOVERY state.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [0] member 2:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [1] member 3:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [2] member 4:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [3] member 5:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [4] member 6:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [5] member 7:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [6] member 8:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [7] member 9:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [8] member 10:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [9] member 11:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [10] member 12:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [11] member 13:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] TRANS [12] member 14:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [0] member 1:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (1.1193)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 5 high delivered 5 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [1] member 2:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [2] member 3:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [3] member 4:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [4] member 5:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [5] member 6:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [6] member 7:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [7] member 8:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [8] member 9:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [9] member 10:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [10] member 11:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [11] member 12:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [12] member 13:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] position [13] member 14:
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] previous ringid (2.1192)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] aru 123 high delivered 123 received flag 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] Did not need to originate any messages in recovery.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] install seq 0 aru 0 high seq received 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] Resetting old ring state
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] recovery to regular 1-0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] waiting_trans_ack changed to 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [MAIN  ] Member joined: r(0) ip(10.3.94.89) 
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [SYNC  ] call init for locally known services
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] entering OPERATIONAL state.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] A new membership (1.1197) was formed. Members joined: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [SYNC  ] enter sync process
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [SYNC  ] Committing synchronization for corosync configuration map access
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CMAP  ] Not first sync -> no action
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 3
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 4
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 5
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 6
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 7
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 8
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 9
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 10
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 11
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 12
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] downlist left_list: 0 received
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got joinlist message from node 13
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] my downlist: members(old:13 left:0)
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] Sending nodelist callback. ring_id = 1.1197
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 13
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 13
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] total_votes=14, expected_votes=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 1 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 3 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 4 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 5 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 6 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 7 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 8 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 9 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 10 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 11 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 12 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 13 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 14 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 2 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] lowest node id: 1 us: 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] highest node id: 14 us: 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] total_votes=14, expected_votes=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 1 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 3 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 4 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 5 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 6 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 7 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 8 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 9 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 10 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 11 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 12 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 13 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 14 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 2 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] lowest node id: 1 us: 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] highest node id: 14 us: 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 3
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 3
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 4
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[4]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 4
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 5
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[5]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 5
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 6
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[6]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 6
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 7
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[7]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 7
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 8
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 8
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 9
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[9]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 9
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 10
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[10]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 10
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 11
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[11]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 11
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 12
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[12]: votes: 1, expected: 14 flags: 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] got nodeinfo message from cluster node 12
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] total_votes=14, expected_votes=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 1 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 3 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 4 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 5 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 6 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 7 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 8 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 9 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 10 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 11 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 12 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 13 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 14 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] node 2 state=1, votes=1, expected=14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] lowest node id: 1 us: 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] highest node id: 14 us: 2
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [QUORUM] sending quorum notification to (nil), length = 104
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [VOTEQ ] Sending quorum callback, quorate = 1
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [TOTEM ] waiting_trans_ack changed to 0
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239
Sep 15 17:41:09 m6kvm2 corosync[25411]:   [CPG   ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 16:57:46
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>I mean this is bad, but also great! 
>>Cam you do a coredump of the whole thing and upload it somewhere with the version info 
>>used (for dbgsym package)? That could help a lot. 

I'll try to reproduce it again (with the full lock everywhere), and do the coredump. 




I have tried the real time scheduling, 

but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking 
something at next corosync restart.) 


this time it was blocked at the same time on a node in: 

work { 
... 
} elsif ($state eq 'active') { 
.... 
$self->update_lrm_status(); 


and another node in 

if ($fence_request) { 
$haenv->log('err', "node need to be fenced - releasing agent_lock\n"); 
$self->set_local_status({ state => 'lost_agent_lock'}); 
} elsif (!$self->get_protected_ha_agent_lock()) { 
$self->set_local_status({ state => 'lost_agent_lock'}); 
} elsif ($self->{mode} eq 'maintenance') { 
$self->set_local_status({ state => 'maintenance'}); 
} 





----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 16:32:52 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: 
>>> Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>>> 
>>> # systemctl edit pve-cluster 
>>> 
>>> And then add snippet: 
>>> 
>>> 
>>> [Service] 
>>> CPUSchedulingPolicy=rr 
>>> CPUSchedulingPriority=99 
> yes, sure, I'll do it now 
> 
> 
>> I'm currently digging the logs 
>>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 
> yes, a simple "systemctl restart corosync" on 1 node each minute 
> 
> 
> 
> After 1hour, it's still locked. 
> 
> on other nodes, I still have pmxfs logs like: 
> 

I mean this is bad, but also great! 
Cam you do a coredump of the whole thing and upload it somewhere with the version info 
used (for dbgsym package)? That could help a lot. 


> manual "pmxcfs -d" 
> https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e 
> 

Hmm, the fuse connection of the previous one got into a weird state (or something is still 
running) but I'd rather say this is a side-effect not directly connected to the real bug. 

> 
> some interesting dmesg about "pvesr" 
> 
> [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. 
> [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 
> [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 
> [Tue Sep 15 14:45:34 2020] Call Trace: 
> [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 
> [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 
> [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 
> [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 
> [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 
> [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 
> [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 
> [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 
> [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 
> [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 
> 

hmm, hangs in mkdir (cluster wide locking) 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-15 15:58                                                                             ` Alexandre DERUMIER
@ 2020-09-16  7:34                                                                               ` Alexandre DERUMIER
  2020-09-16  7:58                                                                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-16  7:34 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

Hi,

I have produced the problem now,

and this time I don't have restarted corosync a second after the lock of /etc/pve

so, currently it's readonly.


I don't have used gdb since a long time, 

could you tell me how to attach to the running pmxcfs process, and some gbd commands to launch ?

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Envoyé: Mardi 15 Septembre 2020 17:58:33
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Another small lock at 17:41:09 

To be sure, I have done a small loop of write each second in /etc/pve, node node2. 

it's hanging at first corosync restart, then, on second corosync restart it's working again. 

I'll try to improve this tomorrow to be able to debug corosync process 
- restarting corosync 
do some write in /etc/pve/ 
- and if it's hanging don't restart corosync again 



node2: echo test > /etc/pve/test loop 
-------------------------------------- 
Current time : 17:41:01 
Current time : 17:41:02 
Current time : 17:41:03 
Current time : 17:41:04 
Current time : 17:41:05 
Current time : 17:41:06 
Current time : 17:41:07 
Current time : 17:41:08 
Current time : 17:41:09 

hang 

Current time : 17:42:05 
Current time : 17:42:06 
Current time : 17:42:07 



node1 
----- 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [TOTEM ] Knet pMTU change: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Global data MTU changed to: 1397 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-31-zx6KJM/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] lib_init_fn: conn=0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got quorum_type request on 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got trackstart request on 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending initial status to 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending quorum notification to 0x556c2918d5f0, length = 52 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CMAP ] lib_init_fn: conn=0x556c2918ef20 
Sep 15 17:41:09 m6kvm1 pmxcfs[16239]: [status] notice: update cluster info (cluster name m6kvm, version = 20) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-33-6RKbvH/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918ad00, cpd=0x556c2918b50c 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-34-GAY5T9/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918c740, cpd=0x556c2918ce8c 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Creating commit token because I am the rep. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Saving state aru 5 high seq received 5 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Storing new sequence id for ring 1197 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering COMMIT state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering RECOVERY state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] TRANS [0] member 1: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [0] member 1: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (1.1193) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 5 high delivered 5 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [1] member 2: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [2] member 3: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [3] member 4: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
ep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [4] member 5: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [5] member 6: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [6] member 7: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [7] member 8: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [8] member 9: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [9] member 10: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [10] member 11: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [11] member 12: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [12] member 13: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [13] member 14: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Did not need to originate any messages in recovery. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Sending initial ORF token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Resetting old ring state 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] recovery to regular 1-0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] waiting_trans_ack changed to 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.90) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.91) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.92) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.93) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.94) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.95) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.96) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.97) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.107) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.108) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.109) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.110) 
ep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.111) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] call init for locally known services 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering OPERATIONAL state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 2 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] enter sync process 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync configuration map access 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 3 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 4 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 5 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 6 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 7 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 8 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 9 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 10 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 11 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 12 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] my downlist: members(old:1 left:0) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=2, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=4, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 3 
.... 
.... 
next corosync restart 

Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] Node was shut down by a signal 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Unloading all Corosync service engines. 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] withdrawing server sockets 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_unref() - destroying 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_disconnect(/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) state:2 
Sep 15 17:42:03 m6kvm1 pmxcfs[16239]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_closed() 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [CMAP ] exit_fn for conn=0x556c2918ef20 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_destroyed() 


node2 
----- 



Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 
Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 
Sep 15 17:41:07 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 received pong: 2 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [TOTEM ] entering GATHER state from 11(merge during join). 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 is up 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Knet host change callback. nodeid: 1 reachable: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 1 link: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Saving state aru 123 high seq received 123 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Storing new sequence id for ring 1197 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering COMMIT state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering RECOVERY state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [0] member 2: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [1] member 3: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [2] member 4: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [3] member 5: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [4] member 6: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [5] member 7: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [6] member 8: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [7] member 9: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [8] member 10: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [9] member 11: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [10] member 12: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [11] member 13: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [12] member 14: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [0] member 1: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (1.1193) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 5 high delivered 5 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [1] member 2: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [2] member 3: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [3] member 4: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [4] member 5: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [5] member 6: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [6] member 7: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [7] member 8: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [8] member 9: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [9] member 10: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [10] member 11: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [11] member 12: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [12] member 13: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [13] member 14: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Did not need to originate any messages in recovery. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Resetting old ring state 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] recovery to regular 1-0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Member joined: r(0) ip(10.3.94.89) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] call init for locally known services 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering OPERATIONAL state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] enter sync process 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync configuration map access 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CMAP ] Not first sync -> no action 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] my downlist: members(old:13 left:0) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[4]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[5]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[6]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[7]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[9]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[10]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[11]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[12]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync vote quorum service v1.0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] sending quorum notification to (nil), length = 104 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending quorum callback, quorate = 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 

----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 16:57:46 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

>>I mean this is bad, but also great! 
>>Cam you do a coredump of the whole thing and upload it somewhere with the version info 
>>used (for dbgsym package)? That could help a lot. 

I'll try to reproduce it again (with the full lock everywhere), and do the coredump. 




I have tried the real time scheduling, 

but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking 
something at next corosync restart.) 


this time it was blocked at the same time on a node in: 

work { 
... 
} elsif ($state eq 'active') { 
.... 
$self->update_lrm_status(); 


and another node in 

if ($fence_request) { 
$haenv->log('err', "node need to be fenced - releasing agent_lock\n"); 
$self->set_local_status({ state => 'lost_agent_lock'}); 
} elsif (!$self->get_protected_ha_agent_lock()) { 
$self->set_local_status({ state => 'lost_agent_lock'}); 
} elsif ($self->{mode} eq 'maintenance') { 
$self->set_local_status({ state => 'maintenance'}); 
} 





----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 16:32:52 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: 
>>> Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>>> 
>>> # systemctl edit pve-cluster 
>>> 
>>> And then add snippet: 
>>> 
>>> 
>>> [Service] 
>>> CPUSchedulingPolicy=rr 
>>> CPUSchedulingPriority=99 
> yes, sure, I'll do it now 
> 
> 
>> I'm currently digging the logs 
>>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 
> yes, a simple "systemctl restart corosync" on 1 node each minute 
> 
> 
> 
> After 1hour, it's still locked. 
> 
> on other nodes, I still have pmxfs logs like: 
> 

I mean this is bad, but also great! 
Cam you do a coredump of the whole thing and upload it somewhere with the version info 
used (for dbgsym package)? That could help a lot. 


> manual "pmxcfs -d" 
> https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e 
> 

Hmm, the fuse connection of the previous one got into a weird state (or something is still 
running) but I'd rather say this is a side-effect not directly connected to the real bug. 

> 
> some interesting dmesg about "pvesr" 
> 
> [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. 
> [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 
> [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 
> [Tue Sep 15 14:45:34 2020] Call Trace: 
> [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 
> [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 
> [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 
> [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 
> [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 
> [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 
> [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 
> [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 
> [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 
> [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 
> 

hmm, hangs in mkdir (cluster wide locking) 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-16  7:34                                                                               ` Alexandre DERUMIER
@ 2020-09-16  7:58                                                                                 ` Alexandre DERUMIER
  2020-09-16  8:30                                                                                   ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-16  7:58 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

here a backtrace with pve-cluster-dbgsym installed


(gdb) bt full
#0  0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce67721988 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#2  0x00007fce678c5f98 in fuse_session_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678cb577 in fuse_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00005617f6d5ab0e in main (argc=<optimized out>, argv=<optimized out>) at pmxcfs.c:1055
        ret = -1
        lockfd = <optimized out>
        pipefd = {8, 9}
        foreground = 0
        force_local_mode = 0
        wrote_pidfile = 1
        memdb = 0x5617f7c563b0
        dcdb = 0x5617f8046ca0
        status_fsm = 0x5617f806a630
        context = <optimized out>
        entries = {{long_name = 0x5617f6d7440f "debug", short_name = 100 'd', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x5617f6d82104 <cfs+20>, description = 0x5617f6d742cb "Turn on debug messages", arg_description = 0x0}, {
            long_name = 0x5617f6d742e2 "foreground", short_name = 102 'f', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc38, description = 0x5617f6d742ed "Do not daemonize server", arg_description = 0x0}, {
            long_name = 0x5617f6d74305 "local", short_name = 108 'l', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc3c, description = 0x5617f6d746a0 "Force local mode (ignore corosync.conf, force quorum)", 
            arg_description = 0x0}, {long_name = 0x0, short_name = 0 '\000', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x0, description = 0x0, arg_description = 0x0}}
        err = 0x0
        __func__ = "main"
        utsname = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "m6kvm1", '\000' <repeats 58 times>, release = "5.4.60-1-pve", '\000' <repeats 52 times>, 
          version = "#1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200)", '\000' <repeats 11 times>, machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>}
        dot = <optimized out>
        www_data = <optimized out>
        create = <optimized out>
        conf_data = 0x5617f80466a0
        len = <optimized out>
        config = <optimized out>
        bplug = <optimized out>
        fa = {0x5617f6d7441e "-f", 0x5617f6d74421 "-odefault_permissions", 0x5617f6d74437 "-oallow_other", 0x0}
        fuse_args = {argc = 1, argv = 0x5617f8046960, allocated = 1}
        fuse_chan = 0x5617f7c560c0
        corosync_loop = 0x5617f80481d0
        service_quorum = 0x5617f8048460
        service_confdb = 0x5617f8069da0
        service_dcdb = 0x5617f806a5d0
        service_status = 0x5617f806a8e0


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Envoyé: Mercredi 16 Septembre 2020 09:34:27
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Hi, 

I have produced the problem now, 

and this time I don't have restarted corosync a second after the lock of /etc/pve 

so, currently it's readonly. 


I don't have used gdb since a long time, 

could you tell me how to attach to the running pmxcfs process, and some gbd commands to launch ? 

----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 17:58:33 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

Another small lock at 17:41:09 

To be sure, I have done a small loop of write each second in /etc/pve, node node2. 

it's hanging at first corosync restart, then, on second corosync restart it's working again. 

I'll try to improve this tomorrow to be able to debug corosync process 
- restarting corosync 
do some write in /etc/pve/ 
- and if it's hanging don't restart corosync again 



node2: echo test > /etc/pve/test loop 
-------------------------------------- 
Current time : 17:41:01 
Current time : 17:41:02 
Current time : 17:41:03 
Current time : 17:41:04 
Current time : 17:41:05 
Current time : 17:41:06 
Current time : 17:41:07 
Current time : 17:41:08 
Current time : 17:41:09 

hang 

Current time : 17:42:05 
Current time : 17:42:06 
Current time : 17:42:07 



node1 
----- 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [TOTEM ] Knet pMTU change: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Global data MTU changed to: 1397 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-31-zx6KJM/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] lib_init_fn: conn=0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got quorum_type request on 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got trackstart request on 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending initial status to 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending quorum notification to 0x556c2918d5f0, length = 52 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CMAP ] lib_init_fn: conn=0x556c2918ef20 
Sep 15 17:41:09 m6kvm1 pmxcfs[16239]: [status] notice: update cluster info (cluster name m6kvm, version = 20) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-33-6RKbvH/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918ad00, cpd=0x556c2918b50c 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-34-GAY5T9/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918c740, cpd=0x556c2918ce8c 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Creating commit token because I am the rep. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Saving state aru 5 high seq received 5 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Storing new sequence id for ring 1197 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering COMMIT state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering RECOVERY state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] TRANS [0] member 1: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [0] member 1: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (1.1193) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 5 high delivered 5 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [1] member 2: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [2] member 3: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [3] member 4: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
ep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [4] member 5: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [5] member 6: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [6] member 7: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [7] member 8: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [8] member 9: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [9] member 10: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [10] member 11: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [11] member 12: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [12] member 13: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [13] member 14: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Did not need to originate any messages in recovery. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Sending initial ORF token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Resetting old ring state 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] recovery to regular 1-0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] waiting_trans_ack changed to 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.90) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.91) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.92) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.93) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.94) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.95) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.96) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.97) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.107) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.108) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.109) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.110) 
ep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.111) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] call init for locally known services 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering OPERATIONAL state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 2 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] enter sync process 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync configuration map access 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 3 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 4 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 5 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 6 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 7 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 8 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 9 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 10 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 11 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 12 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] my downlist: members(old:1 left:0) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=2, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=4, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 3 
.... 
.... 
next corosync restart 

Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] Node was shut down by a signal 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Unloading all Corosync service engines. 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] withdrawing server sockets 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_unref() - destroying 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_disconnect(/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) state:2 
Sep 15 17:42:03 m6kvm1 pmxcfs[16239]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_closed() 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [CMAP ] exit_fn for conn=0x556c2918ef20 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_destroyed() 


node2 
----- 



Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 
Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 
Sep 15 17:41:07 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 received pong: 2 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [TOTEM ] entering GATHER state from 11(merge during join). 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 is up 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Knet host change callback. nodeid: 1 reachable: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 1 link: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Saving state aru 123 high seq received 123 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Storing new sequence id for ring 1197 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering COMMIT state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering RECOVERY state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [0] member 2: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [1] member 3: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [2] member 4: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [3] member 5: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [4] member 6: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [5] member 7: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [6] member 8: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [7] member 9: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [8] member 10: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [9] member 11: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [10] member 12: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [11] member 13: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [12] member 14: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [0] member 1: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (1.1193) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 5 high delivered 5 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [1] member 2: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [2] member 3: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [3] member 4: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [4] member 5: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [5] member 6: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [6] member 7: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [7] member 8: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [8] member 9: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [9] member 10: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [10] member 11: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [11] member 12: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [12] member 13: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [13] member 14: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Did not need to originate any messages in recovery. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Resetting old ring state 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] recovery to regular 1-0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Member joined: r(0) ip(10.3.94.89) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] call init for locally known services 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering OPERATIONAL state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] enter sync process 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync configuration map access 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CMAP ] Not first sync -> no action 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] my downlist: members(old:13 left:0) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[4]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[5]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[6]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[7]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[9]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[10]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[11]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[12]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync vote quorum service v1.0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] sending quorum notification to (nil), length = 104 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending quorum callback, quorate = 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 

----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 16:57:46 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

>>I mean this is bad, but also great! 
>>Cam you do a coredump of the whole thing and upload it somewhere with the version info 
>>used (for dbgsym package)? That could help a lot. 

I'll try to reproduce it again (with the full lock everywhere), and do the coredump. 




I have tried the real time scheduling, 

but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking 
something at next corosync restart.) 


this time it was blocked at the same time on a node in: 

work { 
... 
} elsif ($state eq 'active') { 
.... 
$self->update_lrm_status(); 


and another node in 

if ($fence_request) { 
$haenv->log('err', "node need to be fenced - releasing agent_lock\n"); 
$self->set_local_status({ state => 'lost_agent_lock'}); 
} elsif (!$self->get_protected_ha_agent_lock()) { 
$self->set_local_status({ state => 'lost_agent_lock'}); 
} elsif ($self->{mode} eq 'maintenance') { 
$self->set_local_status({ state => 'maintenance'}); 
} 





----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 16:32:52 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: 
>>> Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>>> 
>>> # systemctl edit pve-cluster 
>>> 
>>> And then add snippet: 
>>> 
>>> 
>>> [Service] 
>>> CPUSchedulingPolicy=rr 
>>> CPUSchedulingPriority=99 
> yes, sure, I'll do it now 
> 
> 
>> I'm currently digging the logs 
>>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 
> yes, a simple "systemctl restart corosync" on 1 node each minute 
> 
> 
> 
> After 1hour, it's still locked. 
> 
> on other nodes, I still have pmxfs logs like: 
> 

I mean this is bad, but also great! 
Cam you do a coredump of the whole thing and upload it somewhere with the version info 
used (for dbgsym package)? That could help a lot. 


> manual "pmxcfs -d" 
> https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e 
> 

Hmm, the fuse connection of the previous one got into a weird state (or something is still 
running) but I'd rather say this is a side-effect not directly connected to the real bug. 

> 
> some interesting dmesg about "pvesr" 
> 
> [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. 
> [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 
> [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 
> [Tue Sep 15 14:45:34 2020] Call Trace: 
> [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 
> [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 
> [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 
> [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 
> [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 
> [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 
> [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 
> [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 
> [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 
> [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 
> 

hmm, hangs in mkdir (cluster wide locking) 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-16  7:58                                                                                 ` Alexandre DERUMIER
@ 2020-09-16  8:30                                                                                   ` Alexandre DERUMIER
  2020-09-16  8:53                                                                                     ` Alexandre DERUMIER
       [not found]                                                                                     ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com>
  0 siblings, 2 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-16  8:30 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

backtrace of all threads.
(I don't have libqb or fuse debug symbol)


(gdb) info threads
  Id   Target Id                                    Frame 
* 1    Thread 0x7fce63d5f900 (LWP 16239) "pmxcfs"   0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0
  2    Thread 0x7fce63ce6700 (LWP 16240) "cfs_loop" 0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  3    Thread 0x7fce618c9700 (LWP 16246) "server"   0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  4    Thread 0x7fce46beb700 (LWP 30256) "pmxcfs"   0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  5    Thread 0x7fce2abf9700 (LWP 43943) "pmxcfs"   0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  6    Thread 0x7fce610a6700 (LWP 13346) "pmxcfs"   0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  7    Thread 0x7fce2a3f8700 (LWP 8832) "pmxcfs"    0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7fce28bf5700 (LWP 3464) "pmxcfs"    0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  9    Thread 0x7fce453e8700 (LWP 3727) "pmxcfs"    0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  10   Thread 0x7fce2bfff700 (LWP 6705) "pmxcfs"    0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  11   Thread 0x7fce293f6700 (LWP 41454) "pmxcfs"   0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  12   Thread 0x7fce45be9700 (LWP 17734) "pmxcfs"   0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  13   Thread 0x7fce2b7fe700 (LWP 17762) "pmxcfs"   0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  14   Thread 0x7fce463ea700 (LWP 2347) "pmxcfs"    0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  15   Thread 0x7fce44be7700 (LWP 11335) "pmxcfs"   0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0


(gdb) thread 1
[Switching to thread 1 (Thread 0x7fce63d5f900 (LWP 16239))]
#0  0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce67721988 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#2  0x00007fce678c5f98 in fuse_session_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678cb577 in fuse_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00005617f6d5ab0e in main (argc=<optimized out>, argv=<optimized out>) at pmxcfs.c:1055
        ret = -1
        lockfd = <optimized out>
        pipefd = {8, 9}
        foreground = 0
        force_local_mode = 0
        wrote_pidfile = 1
        memdb = 0x5617f7c563b0
        dcdb = 0x5617f8046ca0
        status_fsm = 0x5617f806a630
        context = <optimized out>
        entries = {{long_name = 0x5617f6d7440f "debug", short_name = 100 'd', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x5617f6d82104 <cfs+20>, description = 0x5617f6d742cb "Turn on debug messages", arg_description = 0x0}, {
            long_name = 0x5617f6d742e2 "foreground", short_name = 102 'f', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc38, description = 0x5617f6d742ed "Do not daemonize server", arg_description = 0x0}, {
            long_name = 0x5617f6d74305 "local", short_name = 108 'l', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc3c, description = 0x5617f6d746a0 "Force local mode (ignore corosync.conf, force quorum)", 
            arg_description = 0x0}, {long_name = 0x0, short_name = 0 '\000', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x0, description = 0x0, arg_description = 0x0}}
        err = 0x0
        __func__ = "main"
        utsname = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "m6kvm1", '\000' <repeats 58 times>, release = "5.4.60-1-pve", '\000' <repeats 52 times>, 
          version = "#1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200)", '\000' <repeats 11 times>, machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>}
        dot = <optimized out>
        www_data = <optimized out>
        create = <optimized out>
        conf_data = 0x5617f80466a0
        len = <optimized out>
        config = <optimized out>
        bplug = <optimized out>
        fa = {0x5617f6d7441e "-f", 0x5617f6d74421 "-odefault_permissions", 0x5617f6d74437 "-oallow_other", 0x0}
        fuse_args = {argc = 1, argv = 0x5617f8046960, allocated = 1}
        fuse_chan = 0x5617f7c560c0
        corosync_loop = 0x5617f80481d0
        service_quorum = 0x5617f8048460
        service_confdb = 0x5617f8069da0
        service_dcdb = 0x5617f806a5d0
        service_status = 0x5617f806a8e0



(gdb) thread 2
[Switching to thread 2 (Thread 0x7fce63ce6700 (LWP 16240))]
#0  0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007fce67500b2a in ?? () from /lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#2  0x00007fce674f1bb0 in qb_loop_run () from /lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#3  0x00005617f6d5cd31 in cfs_loop_worker_thread (data=0x5617f80481d0) at loop.c:330
        __func__ = "cfs_loop_worker_thread"
        loop = 0x5617f80481d0
        qbloop = 0x5617f8048230
        l = <optimized out>
        ctime = <optimized out>
        th = 7222815479134420993
#4  0x00007fce67968415 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#5  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#6  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

(gdb) bt full
#0  0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007fce67500b2a in ?? () from /lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#2  0x00007fce674f1bb0 in qb_loop_run () from /lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#3  0x00005617f6d5e453 in worker_thread (data=<optimized out>) at server.c:529
        th = 5863405772435619840
        __func__ = "worker_thread"
        _g_boolean_var_ = <optimized out>
#4  0x00007fce67968415 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#5  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#6  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

(gdb) bt full
#0  0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007fce67989f9f in g_cond_wait () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#2  0x00005617f6d67f14 in dfsm_send_message_sync (dfsm=dfsm@entry=0x5617f8046ca0, msgtype=msgtype@entry=5, iov=iov@entry=0x7fce46bea870, len=len@entry=8, rp=rp@entry=0x7fce46bea860) at dfsm.c:339
        __func__ = "dfsm_send_message_sync"
        msgcount = 189465
        header = {base = {type = 0, subtype = 5, protocol_version = 1, time = 1600241342, reserved = 0}, count = 189465}
        real_iov = {{iov_base = 0x7fce46bea7e0, iov_len = 24}, {iov_base = 0x7fce46bea84c, iov_len = 4}, {iov_base = 0x7fce46bea940, iov_len = 4}, {iov_base = 0x7fce46bea858, iov_len = 4}, {iov_base = 0x7fce46bea85c, iov_len = 4}, {
            iov_base = 0x7fce46bea948, iov_len = 4}, {iov_base = 0x7fce1c002980, iov_len = 34}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
        result = CS_OK
#3  0x00005617f6d6491e in dcdb_send_fuse_message (dfsm=0x5617f8046ca0, msg_type=msg_type@entry=DCDB_MESSAGE_CFS_CREATE, path=0x7fce1c002980 "nodes/m6kvm1/lrm_status.tmp.42080", to=to@entry=0x0, buf=buf@entry=0x0, size=<optimized out>, 
    size@entry=0, offset=<optimized out>, flags=<optimized out>) at dcdb.c:157
        iov = {{iov_base = 0x7fce46bea84c, iov_len = 4}, {iov_base = 0x7fce46bea940, iov_len = 4}, {iov_base = 0x7fce46bea858, iov_len = 4}, {iov_base = 0x7fce46bea85c, iov_len = 4}, {iov_base = 0x7fce46bea948, iov_len = 4}, {
            iov_base = 0x7fce1c002980, iov_len = 34}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
        pathlen = 34
        tolen = 0
        rc = {msgcount = 189465, result = -16, processed = 0}
#4  0x00005617f6d6b640 in cfs_plug_memdb_create (mode=<optimized out>, fi=<optimized out>, path=<optimized out>, plug=0x5617f8048b70) at cfs-plug-memdb.c:307
        res = <optimized out>
        mdb = 0x5617f8048b70
        __func__ = "cfs_plug_memdb_create"
        res = <optimized out>
        mdb = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        ctime = <optimized out>
#5  cfs_plug_memdb_create (plug=0x5617f8048b70, path=<optimized out>, mode=<optimized out>, fi=<optimized out>) at cfs-plug-memdb.c:291
        res = <optimized out>
        mdb = <optimized out>
        __func__ = "cfs_plug_memdb_create"
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        ctime = <optimized out>
#6  0x00005617f6d5ca5d in cfs_fuse_create (path=0x7fce1c000bf0 "/nodes/m6kvm1/lrm_status.tmp.42080", mode=33188, fi=0x7fce46beab30) at pmxcfs.c:415
        __func__ = "cfs_fuse_create"
        ret = -13
        subpath = 0x7fce1c002980 "nodes/m6kvm1/lrm_status.tmp.42080"
        plug = <optimized out>
#7  0x00007fce678c152b in fuse_fs_create () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#8  0x00007fce678c165a in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#9  0x00007fce678c7bb7 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#10 0x00007fce678c90b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#11 0x00007fce678c5d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#12 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#13 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

(gdb) thread 5
[Switching to thread 5 (Thread 0x7fce2abf9700 (LWP 43943))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.


gdb) thread 6
[Switching to thread 6 (Thread 0x7fce610a6700 (LWP 13346))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

(gdb) thread 7
[Switching to thread 7 (Thread 0x7fce2a3f8700 (LWP 8832))]
#0  0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007fce67989f9f in g_cond_wait () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#2  0x00005617f6d67f14 in dfsm_send_message_sync (dfsm=dfsm@entry=0x5617f8046ca0, msgtype=msgtype@entry=5, iov=iov@entry=0x7fce2a3f7870, len=len@entry=8, rp=rp@entry=0x7fce2a3f7860) at dfsm.c:339
        __func__ = "dfsm_send_message_sync"
        msgcount = 189466
        header = {base = {type = 0, subtype = 5, protocol_version = 1, time = 1600242491, reserved = 0}, count = 189466}
        real_iov = {{iov_base = 0x7fce2a3f77e0, iov_len = 24}, {iov_base = 0x7fce2a3f784c, iov_len = 4}, {iov_base = 0x7fce2a3f7940, iov_len = 4}, {iov_base = 0x7fce2a3f7858, iov_len = 4}, {iov_base = 0x7fce2a3f785c, iov_len = 4}, {
            iov_base = 0x7fce2a3f7948, iov_len = 4}, {iov_base = 0x7fce1406f350, iov_len = 10}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
        result = CS_OK
#3  0x00005617f6d6491e in dcdb_send_fuse_message (dfsm=0x5617f8046ca0, msg_type=msg_type@entry=DCDB_MESSAGE_CFS_CREATE, path=0x7fce1406f350 "testalex2", to=to@entry=0x0, buf=buf@entry=0x0, size=<optimized out>, size@entry=0, 
    offset=<optimized out>, flags=<optimized out>) at dcdb.c:157
        iov = {{iov_base = 0x7fce2a3f784c, iov_len = 4}, {iov_base = 0x7fce2a3f7940, iov_len = 4}, {iov_base = 0x7fce2a3f7858, iov_len = 4}, {iov_base = 0x7fce2a3f785c, iov_len = 4}, {iov_base = 0x7fce2a3f7948, iov_len = 4}, {
            iov_base = 0x7fce1406f350, iov_len = 10}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
        pathlen = 10
        tolen = 0
        rc = {msgcount = 189466, result = -16, processed = 0}
#4  0x00005617f6d6b640 in cfs_plug_memdb_create (mode=<optimized out>, fi=<optimized out>, path=<optimized out>, plug=0x5617f8048b70) at cfs-plug-memdb.c:307
        res = <optimized out>
        mdb = 0x5617f8048b70
        __func__ = "cfs_plug_memdb_create"
        res = <optimized out>
        mdb = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        ctime = <optimized out>
#5  cfs_plug_memdb_create (plug=0x5617f8048b70, path=<optimized out>, mode=<optimized out>, fi=<optimized out>) at cfs-plug-memdb.c:291
        res = <optimized out>
        mdb = <optimized out>
        __func__ = "cfs_plug_memdb_create"
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        _g_boolean_var_ = <optimized out>
        ctime = <optimized out>
#6  0x00005617f6d5ca5d in cfs_fuse_create (path=0x7fce14022ff0 "/testalex2", mode=33188, fi=0x7fce2a3f7b30) at pmxcfs.c:415
        __func__ = "cfs_fuse_create"
        ret = -13
        subpath = 0x7fce1406f350 "testalex2"
        plug = <optimized out>
#7  0x00007fce678c152b in fuse_fs_create () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#8  0x00007fce678c165a in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#9  0x00007fce678c7bb7 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#10 0x00007fce678c90b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#11 0x00007fce678c5d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#12 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#13 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.


(gdb) thread 8
[Switching to thread 8 (Thread 0x7fce28bf5700 (LWP 3464))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.


(gdb) thread 9
[Switching to thread 9 (Thread 0x7fce453e8700 (LWP 3727))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.


(gdb) thread 10
[Switching to thread 10 (Thread 0x7fce2bfff700 (LWP 6705))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

(gdb) thread 11
[Switching to thread 11 (Thread 0x7fce293f6700 (LWP 41454))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

(gdb) thread 12
[Switching to thread 12 (Thread 0x7fce45be9700 (LWP 17734))]
#0  0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007fce67989f9f in g_cond_wait () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#2  0x00005617f6d67f14 in dfsm_send_message_sync (dfsm=dfsm@entry=0x5617f8046ca0, msgtype=msgtype@entry=2, iov=iov@entry=0x7fce45be88e0, len=len@entry=8, rp=rp@entry=0x7fce45be88d0) at dfsm.c:339
        __func__ = "dfsm_send_message_sync"
        msgcount = 189464
        header = {base = {type = 0, subtype = 2, protocol_version = 1, time = 1600241340, reserved = 0}, count = 189464}
        real_iov = {{iov_base = 0x7fce45be8850, iov_len = 24}, {iov_base = 0x7fce45be88bc, iov_len = 4}, {iov_base = 0x7fce45be89b0, iov_len = 4}, {iov_base = 0x7fce45be88c8, iov_len = 4}, {iov_base = 0x7fce45be88cc, iov_len = 4}, {
            iov_base = 0x7fce45be89b8, iov_len = 4}, {iov_base = 0x7fce34001dd0, iov_len = 31}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
        result = CS_OK
#3  0x00005617f6d6491e in dcdb_send_fuse_message (dfsm=0x5617f8046ca0, msg_type=msg_type@entry=DCDB_MESSAGE_CFS_MKDIR, path=0x7fce34001dd0 "priv/lock/file-replication_cfg", to=to@entry=0x0, buf=buf@entry=0x0, size=<optimized out>, 
    size@entry=0, offset=<optimized out>, flags=<optimized out>) at dcdb.c:157
        iov = {{iov_base = 0x7fce45be88bc, iov_len = 4}, {iov_base = 0x7fce45be89b0, iov_len = 4}, {iov_base = 0x7fce45be88c8, iov_len = 4}, {iov_base = 0x7fce45be88cc, iov_len = 4}, {iov_base = 0x7fce45be89b8, iov_len = 4}, {
            iov_base = 0x7fce34001dd0, iov_len = 31}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
        pathlen = 31
        tolen = 0
        rc = {msgcount = 189464, result = -16, processed = 0}
#4  0x00005617f6d6b0e7 in cfs_plug_memdb_mkdir (plug=0x5617f8048b70, path=<optimized out>, mode=<optimized out>) at cfs-plug-memdb.c:329
        __func__ = "cfs_plug_memdb_mkdir"
        res = <optimized out>
        mdb = 0x5617f8048b70
#5  0x00005617f6d5bc70 in cfs_fuse_mkdir (path=0x7fce34002240 "/priv/lock/file-replication_cfg", mode=493) at pmxcfs.c:238
        __func__ = "cfs_fuse_mkdir"
        ret = -13
        subpath = 0x7fce34001dd0 "priv/lock/file-replication_cfg"
        plug = <optimized out>
#6  0x00007fce678c2ccb in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#7  0x00007fce678c90b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#8  0x00007fce678c5d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#9  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#10 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

(gdb) thread 13
[Switching to thread 13 (Thread 0x7fce2b7fe700 (LWP 17762))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.


(gdb) thread 14
[Switching to thread 14 (Thread 0x7fce463ea700 (LWP 2347))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.


(gdb) thread 15
[Switching to thread 15 (Thread 0x7fce44be7700 (LWP 11335))]
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt full
#0  0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#2  0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#3  0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
No symbol table info available.
#4  0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Envoyé: Mercredi 16 Septembre 2020 09:58:02
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

here a backtrace with pve-cluster-dbgsym installed 


(gdb) bt full 
#0 0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0 
No symbol table info available. 
#1 0x00007fce67721988 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0 
No symbol table info available. 
#2 0x00007fce678c5f98 in fuse_session_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2 
No symbol table info available. 
#3 0x00007fce678cb577 in fuse_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2 
No symbol table info available. 
#4 0x00005617f6d5ab0e in main (argc=<optimized out>, argv=<optimized out>) at pmxcfs.c:1055 
ret = -1 
lockfd = <optimized out> 
pipefd = {8, 9} 
foreground = 0 
force_local_mode = 0 
wrote_pidfile = 1 
memdb = 0x5617f7c563b0 
dcdb = 0x5617f8046ca0 
status_fsm = 0x5617f806a630 
context = <optimized out> 
entries = {{long_name = 0x5617f6d7440f "debug", short_name = 100 'd', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x5617f6d82104 <cfs+20>, description = 0x5617f6d742cb "Turn on debug messages", arg_description = 0x0}, { 
long_name = 0x5617f6d742e2 "foreground", short_name = 102 'f', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc38, description = 0x5617f6d742ed "Do not daemonize server", arg_description = 0x0}, { 
long_name = 0x5617f6d74305 "local", short_name = 108 'l', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc3c, description = 0x5617f6d746a0 "Force local mode (ignore corosync.conf, force quorum)", 
arg_description = 0x0}, {long_name = 0x0, short_name = 0 '\000', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x0, description = 0x0, arg_description = 0x0}} 
err = 0x0 
__func__ = "main" 
utsname = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "m6kvm1", '\000' <repeats 58 times>, release = "5.4.60-1-pve", '\000' <repeats 52 times>, 
version = "#1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200)", '\000' <repeats 11 times>, machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>} 
dot = <optimized out> 
www_data = <optimized out> 
create = <optimized out> 
conf_data = 0x5617f80466a0 
len = <optimized out> 
config = <optimized out> 
bplug = <optimized out> 
fa = {0x5617f6d7441e "-f", 0x5617f6d74421 "-odefault_permissions", 0x5617f6d74437 "-oallow_other", 0x0} 
fuse_args = {argc = 1, argv = 0x5617f8046960, allocated = 1} 
fuse_chan = 0x5617f7c560c0 
corosync_loop = 0x5617f80481d0 
service_quorum = 0x5617f8048460 
service_confdb = 0x5617f8069da0 
service_dcdb = 0x5617f806a5d0 
service_status = 0x5617f806a8e0 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
Envoyé: Mercredi 16 Septembre 2020 09:34:27 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

Hi, 

I have produced the problem now, 

and this time I don't have restarted corosync a second after the lock of /etc/pve 

so, currently it's readonly. 


I don't have used gdb since a long time, 

could you tell me how to attach to the running pmxcfs process, and some gbd commands to launch ? 

----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 17:58:33 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

Another small lock at 17:41:09 

To be sure, I have done a small loop of write each second in /etc/pve, node node2. 

it's hanging at first corosync restart, then, on second corosync restart it's working again. 

I'll try to improve this tomorrow to be able to debug corosync process 
- restarting corosync 
do some write in /etc/pve/ 
- and if it's hanging don't restart corosync again 



node2: echo test > /etc/pve/test loop 
-------------------------------------- 
Current time : 17:41:01 
Current time : 17:41:02 
Current time : 17:41:03 
Current time : 17:41:04 
Current time : 17:41:05 
Current time : 17:41:06 
Current time : 17:41:07 
Current time : 17:41:08 
Current time : 17:41:09 

hang 

Current time : 17:42:05 
Current time : 17:42:06 
Current time : 17:42:07 



node1 
----- 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [TOTEM ] Knet pMTU change: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 
Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Global data MTU changed to: 1397 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-31-zx6KJM/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] lib_init_fn: conn=0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got quorum_type request on 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got trackstart request on 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending initial status to 0x556c2918d5f0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending quorum notification to 0x556c2918d5f0, length = 52 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CMAP ] lib_init_fn: conn=0x556c2918ef20 
Sep 15 17:41:09 m6kvm1 pmxcfs[16239]: [status] notice: update cluster info (cluster name m6kvm, version = 20) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-33-6RKbvH/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918ad00, cpd=0x556c2918b50c 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-34-GAY5T9/qb) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918c740, cpd=0x556c2918ce8c 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Creating commit token because I am the rep. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Saving state aru 5 high seq received 5 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Storing new sequence id for ring 1197 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering COMMIT state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering RECOVERY state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] TRANS [0] member 1: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [0] member 1: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (1.1193) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 5 high delivered 5 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [1] member 2: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [2] member 3: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [3] member 4: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
ep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [4] member 5: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [5] member 6: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [6] member 7: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [7] member 8: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [8] member 9: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [9] member 10: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [10] member 11: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [11] member 12: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [12] member 13: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [13] member 14: 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Did not need to originate any messages in recovery. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Sending initial ORF token 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Resetting old ring state 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] recovery to regular 1-0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] waiting_trans_ack changed to 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.90) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.91) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.92) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.93) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.94) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.95) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.96) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.97) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.107) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.108) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.109) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.110) 
ep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.111) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] call init for locally known services 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering OPERATIONAL state. 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 2 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] enter sync process 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync configuration map access 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 3 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 4 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 5 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 6 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 7 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 8 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 9 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 10 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 11 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 12 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] my downlist: members(old:1 left:0) 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=2, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=4, expected_votes=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 3 
.... 
.... 
next corosync restart 

Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] Node was shut down by a signal 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Unloading all Corosync service engines. 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] withdrawing server sockets 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_unref() - destroying 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_disconnect(/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) state:2 
Sep 15 17:42:03 m6kvm1 pmxcfs[16239]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_closed() 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [CMAP ] exit_fn for conn=0x556c2918ef20 
Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_destroyed() 


node2 
----- 



Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 
Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 
Sep 15 17:41:07 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 received pong: 2 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [TOTEM ] entering GATHER state from 11(merge during join). 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 is up 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Knet host change callback. nodeid: 1 reachable: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 1 link: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Saving state aru 123 high seq received 123 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Storing new sequence id for ring 1197 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering COMMIT state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering RECOVERY state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [0] member 2: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [1] member 3: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [2] member 4: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [3] member 5: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [4] member 6: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [5] member 7: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [6] member 8: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [7] member 9: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [8] member 10: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [9] member 11: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [10] member 12: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [11] member 13: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [12] member 14: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [0] member 1: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (1.1193) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 5 high delivered 5 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [1] member 2: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [2] member 3: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [3] member 4: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [4] member 5: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [5] member 6: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [6] member 7: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [7] member 8: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [8] member 9: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [9] member 10: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [10] member 11: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [11] member 12: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [12] member 13: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [13] member 14: 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Did not need to originate any messages in recovery. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Resetting old ring state 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] recovery to regular 1-0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Member joined: r(0) ip(10.3.94.89) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] call init for locally known services 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering OPERATIONAL state. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] enter sync process 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync configuration map access 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CMAP ] Not first sync -> no action 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] my downlist: members(old:13 left:0) 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[4]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[5]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[6]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[7]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[9]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[10]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[11]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[12]: votes: 1, expected: 14 flags: 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync vote quorum service v1.0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] sending quorum notification to (nil), length = 104 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending quorum callback, quorate = 1 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Completed service synchronization, ready to provide service. 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 0 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 
Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 

----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 16:57:46 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

>>I mean this is bad, but also great! 
>>Cam you do a coredump of the whole thing and upload it somewhere with the version info 
>>used (for dbgsym package)? That could help a lot. 

I'll try to reproduce it again (with the full lock everywhere), and do the coredump. 




I have tried the real time scheduling, 

but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking 
something at next corosync restart.) 


this time it was blocked at the same time on a node in: 

work { 
... 
} elsif ($state eq 'active') { 
.... 
$self->update_lrm_status(); 


and another node in 

if ($fence_request) { 
$haenv->log('err', "node need to be fenced - releasing agent_lock\n"); 
$self->set_local_status({ state => 'lost_agent_lock'}); 
} elsif (!$self->get_protected_ha_agent_lock()) { 
$self->set_local_status({ state => 'lost_agent_lock'}); 
} elsif ($self->{mode} eq 'maintenance') { 
$self->set_local_status({ state => 'maintenance'}); 
} 





----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
À: "aderumier" <aderumier@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 16:32:52 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: 
>>> Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>>> 
>>> # systemctl edit pve-cluster 
>>> 
>>> And then add snippet: 
>>> 
>>> 
>>> [Service] 
>>> CPUSchedulingPolicy=rr 
>>> CPUSchedulingPriority=99 
> yes, sure, I'll do it now 
> 
> 
>> I'm currently digging the logs 
>>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? 
> yes, a simple "systemctl restart corosync" on 1 node each minute 
> 
> 
> 
> After 1hour, it's still locked. 
> 
> on other nodes, I still have pmxfs logs like: 
> 

I mean this is bad, but also great! 
Cam you do a coredump of the whole thing and upload it somewhere with the version info 
used (for dbgsym package)? That could help a lot. 


> manual "pmxcfs -d" 
> https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e 
> 

Hmm, the fuse connection of the previous one got into a weird state (or something is still 
running) but I'd rather say this is a side-effect not directly connected to the real bug. 

> 
> some interesting dmesg about "pvesr" 
> 
> [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. 
> [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 
> [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
> [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 
> [Tue Sep 15 14:45:34 2020] Call Trace: 
> [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 
> [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 
> [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 
> [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 
> [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 
> [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 
> [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 
> [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 
> [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 
> [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 
> 

hmm, hangs in mkdir (cluster wide locking) 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 






^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-16  8:30                                                                                   ` Alexandre DERUMIER
@ 2020-09-16  8:53                                                                                     ` Alexandre DERUMIER
       [not found]                                                                                     ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com>
  1 sibling, 0 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-16  8:53 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

my last mail was too long, here the backtrace of pmxcfs in gist



node1-bt  -> where the corosync is restarted each minute
-------- 
https://gist.github.com/aderumier/ed21f22aa6ed9099ec0199255112f6b6 

node2-bt  -> other node hanging too
-------- 
https://gist.github.com/aderumier/31fb72b93e77a93fbaec975bc54dfb3a




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
       [not found]                                                                                     ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com>
@ 2020-09-16 13:15                                                                                       ` Alexandre DERUMIER
  2020-09-16 14:45                                                                                         ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-16 13:15 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

I have reproduce it again, with pmxcfs in debug mode

corosync restart at 15:02:10, and it was already block on other nodes at 15:02:12

The pmxcfs was still logging after the lock.


here the log on node1 where corosync has been restarted

http://odisoweb1.odiso.net/pmxcfs-corosync.log



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-16 13:15                                                                                       ` Alexandre DERUMIER
@ 2020-09-16 14:45                                                                                         ` Thomas Lamprecht
  2020-09-16 15:17                                                                                           ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-16 14:45 UTC (permalink / raw)
  To: Alexandre DERUMIER, Proxmox VE development discussion

On 9/16/20 3:15 PM, Alexandre DERUMIER wrote:
> I have reproduce it again, with pmxcfs in debug mode
> 
> corosync restart at 15:02:10, and it was already block on other nodes at 15:02:12
> 
> The pmxcfs was still logging after the lock.
> 
> 
> here the log on node1 where corosync has been restarted
> 
> http://odisoweb1.odiso.net/pmxcfs-corosync.log
> 


thanks for those, I need a bit to sift through them. Seem like either dfsm gets
out of sync or we do not get a ACK reply from cpg_send.

A full core dump would be still nice, in gdb:
generate-core-file

PS: instead of manually switching to threads you can do:
thread apply all bt full

to get a backtrace for all threads in one command




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-16 14:45                                                                                         ` Thomas Lamprecht
@ 2020-09-16 15:17                                                                                           ` Alexandre DERUMIER
  2020-09-17  9:21                                                                                             ` Fabian Grünbichler
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-16 15:17 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion

I have produce it again, with the coredump this time


restart corosync : 17:05:27

http://odisoweb1.odiso.net/pmxcfs-corosync2.log


bt full

https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b


coredump


http://odisoweb1.odiso.net/core.7761.gz



----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mercredi 16 Septembre 2020 16:45:12
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/16/20 3:15 PM, Alexandre DERUMIER wrote: 
> I have reproduce it again, with pmxcfs in debug mode 
> 
> corosync restart at 15:02:10, and it was already block on other nodes at 15:02:12 
> 
> The pmxcfs was still logging after the lock. 
> 
> 
> here the log on node1 where corosync has been restarted 
> 
> http://odisoweb1.odiso.net/pmxcfs-corosync.log 
> 


thanks for those, I need a bit to sift through them. Seem like either dfsm gets 
out of sync or we do not get a ACK reply from cpg_send. 

A full core dump would be still nice, in gdb: 
generate-core-file 

PS: instead of manually switching to threads you can do: 
thread apply all bt full 

to get a backtrace for all threads in one command 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-16 15:17                                                                                           ` Alexandre DERUMIER
@ 2020-09-17  9:21                                                                                             ` Fabian Grünbichler
  2020-09-17  9:59                                                                                               ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Fabian Grünbichler @ 2020-09-17  9:21 UTC (permalink / raw)
  To: Proxmox VE development discussion, Thomas Lamprecht

On September 16, 2020 5:17 pm, Alexandre DERUMIER wrote:
> I have produce it again, with the coredump this time
> 
> 
> restart corosync : 17:05:27
> 
> http://odisoweb1.odiso.net/pmxcfs-corosync2.log
> 
> 
> bt full
> 
> https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b
> 
> 
> coredump
> 
> 
> http://odisoweb1.odiso.net/core.7761.gz

just a short update on this:

dcdb is stuck in START_SYNC mode, but nodeid 13 hasn't sent a STATE msg 
(yet). this looks like either the START_SYNC message to node 13, or the 
STATE response from it got lost or processed wrong. until the mode
switches to SYNCED (after all states have been received and the state 
update went through), regular/normal messages can be sent, but the 
incoming normal messages are queued and not processed. this is why the 
fuse access blocks, it sends the request out, but the response ends up 
in the queue.

status (the other thing running on top of dfsm) got correctly synced up 
at the same time, so it's either a dcdb specific bug, or just bad luck 
that one was affected and the other wasn't.

unfortunately even with debug enabled the logs don't contain much 
information that would help (e.g., we don't log sending/receiving STATE 
messages except when they look 'wrong'), so Thomas is trying to 
reproduce this using your scenario here to improve turn around time. if 
we can't reproduce it, we'll have to send you patches/patched debs with 
increased logging to narrow down what is going on. if we can, than we 
can hopefully find and fix the issue fast.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-17  9:21                                                                                             ` Fabian Grünbichler
@ 2020-09-17  9:59                                                                                               ` Alexandre DERUMIER
  2020-09-17 10:02                                                                                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-17  9:59 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

Thanks for the update.

>> if
>>we can't reproduce it, we'll have to send you patches/patched debs with
>>increased logging to narrow down what is going on. if we can, than we
>>can hopefully find and fix the issue fast.

No problem, I can install the patched deb if needed.



----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Envoyé: Jeudi 17 Septembre 2020 11:21:45
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On September 16, 2020 5:17 pm, Alexandre DERUMIER wrote: 
> I have produce it again, with the coredump this time 
> 
> 
> restart corosync : 17:05:27 
> 
> http://odisoweb1.odiso.net/pmxcfs-corosync2.log 
> 
> 
> bt full 
> 
> https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b 
> 
> 
> coredump 
> 
> 
> http://odisoweb1.odiso.net/core.7761.gz 

just a short update on this: 

dcdb is stuck in START_SYNC mode, but nodeid 13 hasn't sent a STATE msg 
(yet). this looks like either the START_SYNC message to node 13, or the 
STATE response from it got lost or processed wrong. until the mode 
switches to SYNCED (after all states have been received and the state 
update went through), regular/normal messages can be sent, but the 
incoming normal messages are queued and not processed. this is why the 
fuse access blocks, it sends the request out, but the response ends up 
in the queue. 

status (the other thing running on top of dfsm) got correctly synced up 
at the same time, so it's either a dcdb specific bug, or just bad luck 
that one was affected and the other wasn't. 

unfortunately even with debug enabled the logs don't contain much 
information that would help (e.g., we don't log sending/receiving STATE 
messages except when they look 'wrong'), so Thomas is trying to 
reproduce this using your scenario here to improve turn around time. if 
we can't reproduce it, we'll have to send you patches/patched debs with 
increased logging to narrow down what is going on. if we can, than we 
can hopefully find and fix the issue fast. 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-17  9:59                                                                                               ` Alexandre DERUMIER
@ 2020-09-17 10:02                                                                                                 ` Alexandre DERUMIER
  2020-09-17 11:35                                                                                                   ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-17 10:02 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

if needed, here my test script to reproduce it

node1 (restart corosync until node2 don't send the timestamp anymore)
-----

#!/bin/bash

for i in `seq 10000`; do 
   now=$(date +"%T")
   echo "restart corosync : $now"
    systemctl restart corosync
    for j in {1..59}; do
        last=$(cat /tmp/timestamp)
        curr=`date '+%s'`
        diff=$(($curr - $last))
        if [ $diff -gt 20 ]; then
           echo "too old"
           exit 0
        fi
        sleep 1
     done
done 



node2 (write to /etc/pve/test each second, then send the last timestamp to node1)
-----
#!/bin/bash
for i in {1..10000};
do
   now=$(date +"%T")
   echo "Current time : $now"
   curr=`date '+%s'`
   ssh root@node1 "echo $curr > /tmp/timestamp"
   echo "test" > /etc/pve/test
   sleep 1
done


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Envoyé: Jeudi 17 Septembre 2020 11:59:32
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Thanks for the update. 

>> if 
>>we can't reproduce it, we'll have to send you patches/patched debs with 
>>increased logging to narrow down what is going on. if we can, than we 
>>can hopefully find and fix the issue fast. 

No problem, I can install the patched deb if needed. 



----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
Envoyé: Jeudi 17 Septembre 2020 11:21:45 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 16, 2020 5:17 pm, Alexandre DERUMIER wrote: 
> I have produce it again, with the coredump this time 
> 
> 
> restart corosync : 17:05:27 
> 
> http://odisoweb1.odiso.net/pmxcfs-corosync2.log 
> 
> 
> bt full 
> 
> https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b 
> 
> 
> coredump 
> 
> 
> http://odisoweb1.odiso.net/core.7761.gz 

just a short update on this: 

dcdb is stuck in START_SYNC mode, but nodeid 13 hasn't sent a STATE msg 
(yet). this looks like either the START_SYNC message to node 13, or the 
STATE response from it got lost or processed wrong. until the mode 
switches to SYNCED (after all states have been received and the state 
update went through), regular/normal messages can be sent, but the 
incoming normal messages are queued and not processed. this is why the 
fuse access blocks, it sends the request out, but the response ends up 
in the queue. 

status (the other thing running on top of dfsm) got correctly synced up 
at the same time, so it's either a dcdb specific bug, or just bad luck 
that one was affected and the other wasn't. 

unfortunately even with debug enabled the logs don't contain much 
information that would help (e.g., we don't log sending/receiving STATE 
messages except when they look 'wrong'), so Thomas is trying to 
reproduce this using your scenario here to improve turn around time. if 
we can't reproduce it, we'll have to send you patches/patched debs with 
increased logging to narrow down what is going on. if we can, than we 
can hopefully find and fix the issue fast. 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-17 10:02                                                                                                 ` Alexandre DERUMIER
@ 2020-09-17 11:35                                                                                                   ` Thomas Lamprecht
  2020-09-20 23:54                                                                                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-17 11:35 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER

On 9/17/20 12:02 PM, Alexandre DERUMIER wrote:
> if needed, here my test script to reproduce it

thanks, I'm now using this specific one, had a similar (but all nodes writes)
running here since ~ two hours without luck yet, lets see how this behaves.

> 
> node1 (restart corosync until node2 don't send the timestamp anymore)
> -----
> 
> #!/bin/bash
> 
> for i in `seq 10000`; do 
>    now=$(date +"%T")
>    echo "restart corosync : $now"
>     systemctl restart corosync
>     for j in {1..59}; do
>         last=$(cat /tmp/timestamp)
>         curr=`date '+%s'`
>         diff=$(($curr - $last))
>         if [ $diff -gt 20 ]; then
>            echo "too old"
>            exit 0
>         fi
>         sleep 1
>      done
> done 
> 
> 
> 
> node2 (write to /etc/pve/test each second, then send the last timestamp to node1)
> -----
> #!/bin/bash
> for i in {1..10000};
> do
>    now=$(date +"%T")
>    echo "Current time : $now"
>    curr=`date '+%s'`
>    ssh root@node1 "echo $curr > /tmp/timestamp"
>    echo "test" > /etc/pve/test
>    sleep 1
> done
> 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-17 11:35                                                                                                   ` Thomas Lamprecht
@ 2020-09-20 23:54                                                                                                     ` Alexandre DERUMIER
  2020-09-22  5:43                                                                                                       ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-20 23:54 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox VE development discussion

Hi,

I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s.

I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes.


I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes



node1: corosync stop: 01:26:50
node2 : /etc/pve locked : 01:26:51

http://odisoweb1.odiso.net/corosync-stop.log


pmxcfs : bt full all threads:

https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65

pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz


node1:corosync start: 01:35:36
http://odisoweb1.odiso.net/corosync-start.log





BTW, I have been contacted in pm on the forum by a user following this mailing thread,
and he had exactly the same problem with a 7 nodes cluster recently.
(shutting down 1 node, /etc/pve was locked until the node was restarted)



----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>
Envoyé: Jeudi 17 Septembre 2020 13:35:55
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: 
> if needed, here my test script to reproduce it 

thanks, I'm now using this specific one, had a similar (but all nodes writes) 
running here since ~ two hours without luck yet, lets see how this behaves. 

> 
> node1 (restart corosync until node2 don't send the timestamp anymore) 
> ----- 
> 
> #!/bin/bash 
> 
> for i in `seq 10000`; do 
> now=$(date +"%T") 
> echo "restart corosync : $now" 
> systemctl restart corosync 
> for j in {1..59}; do 
> last=$(cat /tmp/timestamp) 
> curr=`date '+%s'` 
> diff=$(($curr - $last)) 
> if [ $diff -gt 20 ]; then 
> echo "too old" 
> exit 0 
> fi 
> sleep 1 
> done 
> done 
> 
> 
> 
> node2 (write to /etc/pve/test each second, then send the last timestamp to node1) 
> ----- 
> #!/bin/bash 
> for i in {1..10000}; 
> do 
> now=$(date +"%T") 
> echo "Current time : $now" 
> curr=`date '+%s'` 
> ssh root@node1 "echo $curr > /tmp/timestamp" 
> echo "test" > /etc/pve/test 
> sleep 1 
> done 
> 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-20 23:54                                                                                                     ` Alexandre DERUMIER
@ 2020-09-22  5:43                                                                                                       ` Alexandre DERUMIER
  2020-09-24 14:02                                                                                                         ` Fabian Grünbichler
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-22  5:43 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Thomas Lamprecht

I have done test with "kill -9 <pidofcorosync",  and I have around 20s hang on other nodes,
but after that it's become available again.


So, it's really something when corosync is in shutdown phase, and pmxcfs is running.

So, for now, as workaround, I have changed

/lib/systemd/system/pve-cluster.service

#Wants=corosync.service
#Before=corosync.service
Requires=corosync.service
After=corosync.service


Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first.




----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 21 Septembre 2020 01:54:59
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Hi, 

I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s. 

I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes. 


I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes 



node1: corosync stop: 01:26:50 
node2 : /etc/pve locked : 01:26:51 

http://odisoweb1.odiso.net/corosync-stop.log 


pmxcfs : bt full all threads: 

https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65 

pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz 


node1:corosync start: 01:35:36 
http://odisoweb1.odiso.net/corosync-start.log 





BTW, I have been contacted in pm on the forum by a user following this mailing thread, 
and he had exactly the same problem with a 7 nodes cluster recently. 
(shutting down 1 node, /etc/pve was locked until the node was restarted) 



----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com> 
Envoyé: Jeudi 17 Septembre 2020 13:35:55 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: 
> if needed, here my test script to reproduce it 

thanks, I'm now using this specific one, had a similar (but all nodes writes) 
running here since ~ two hours without luck yet, lets see how this behaves. 

> 
> node1 (restart corosync until node2 don't send the timestamp anymore) 
> ----- 
> 
> #!/bin/bash 
> 
> for i in `seq 10000`; do 
> now=$(date +"%T") 
> echo "restart corosync : $now" 
> systemctl restart corosync 
> for j in {1..59}; do 
> last=$(cat /tmp/timestamp) 
> curr=`date '+%s'` 
> diff=$(($curr - $last)) 
> if [ $diff -gt 20 ]; then 
> echo "too old" 
> exit 0 
> fi 
> sleep 1 
> done 
> done 
> 
> 
> 
> node2 (write to /etc/pve/test each second, then send the last timestamp to node1) 
> ----- 
> #!/bin/bash 
> for i in {1..10000}; 
> do 
> now=$(date +"%T") 
> echo "Current time : $now" 
> curr=`date '+%s'` 
> ssh root@node1 "echo $curr > /tmp/timestamp" 
> echo "test" > /etc/pve/test 
> sleep 1 
> done 
> 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-22  5:43                                                                                                       ` Alexandre DERUMIER
@ 2020-09-24 14:02                                                                                                         ` Fabian Grünbichler
  2020-09-24 14:29                                                                                                           ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Fabian Grünbichler @ 2020-09-24 14:02 UTC (permalink / raw)
  To: Proxmox VE development discussion

On September 22, 2020 7:43 am, Alexandre DERUMIER wrote:
> I have done test with "kill -9 <pidofcorosync",  and I have around 20s hang on other nodes,
> but after that it's become available again.
> 
> 
> So, it's really something when corosync is in shutdown phase, and pmxcfs is running.
> 
> So, for now, as workaround, I have changed
> 
> /lib/systemd/system/pve-cluster.service
> 
> #Wants=corosync.service
> #Before=corosync.service
> Requires=corosync.service
> After=corosync.service
> 
> 
> Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first.

if you are still able to test, it would be great if you could give the 
following packages a spin (they only contain some extra debug prints 
on message processing/sending):

http://download.proxmox.com/temp/pmxcfs-dbg/

64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36  pve-cluster_6.1-8_amd64.deb
04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef  pve-cluster-dbgsym_6.1-8_amd64.deb

ideally, you could get the debug logs from all nodes, and the 
coredump/bt from the node where pmxcfs hangs. thanks!

diff --git a/data/src/dfsm.c b/data/src/dfsm.c
index 529c7f9..e0bd93f 100644
--- a/data/src/dfsm.c
+++ b/data/src/dfsm.c
@@ -162,8 +162,8 @@ static void
 dfsm_send_sync_message_abort(dfsm_t *dfsm)
 {
 	g_return_if_fail(dfsm != NULL);
-
 	g_mutex_lock (&dfsm->sync_mutex);
+	cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount);
 	dfsm->msgcount_rcvd = dfsm->msgcount;
 	g_cond_broadcast (&dfsm->sync_cond);
 	g_mutex_unlock (&dfsm->sync_mutex);
@@ -181,6 +181,7 @@ dfsm_record_local_result(
 
 	g_mutex_lock (&dfsm->sync_mutex);
 	dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count);
+	cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64":  %d", msg_count, msg_result);
 	if (rp) {
 		rp->result = msg_result;
 		rp->processed = processed;
@@ -235,6 +236,8 @@ dfsm_send_state_message_full(
 	g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM);
 	g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM);
 
+	cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len);
+
 	dfsm_message_state_header_t header;
 	header.base.type = type;
 	header.base.subtype = 0;
@@ -317,6 +320,7 @@ dfsm_send_message_sync(
 	for (int i = 0; i < len; i++)
 		real_iov[i + 1] = iov[i];
 
+	cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len);
 	cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1);
 
 	g_mutex_unlock (&dfsm->sync_mutex);
@@ -335,10 +339,12 @@ dfsm_send_message_sync(
 	if (rp) {
 		g_mutex_lock (&dfsm->sync_mutex);
 
-		while (dfsm->msgcount_rcvd < msgcount)
+		while (dfsm->msgcount_rcvd < msgcount) {
+			cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount);
 			g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex);
+		}
+		cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!");
 
-      
 		g_hash_table_remove(dfsm->results, &rp->msgcount);
 		
 		g_mutex_unlock (&dfsm->sync_mutex);
@@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback(
 		return;
 	}
 
+	cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type);
+
 	if (base_header->type == DFSM_MESSAGE_NORMAL) {
 
 		dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg;
+		cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)",
+			      base_header->type, base_header->subtype, msg_len);
 
 		if (msg_len < sizeof(dfsm_message_normal_header_t)) {
 			cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)",
@@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback(
 		} else {
 
 			int msg_res = -1;
+			cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)",
+				      header->count, base_header->subtype, msg_len); 
 			int res = dfsm->dfsm_callbacks->dfsm_deliver_fn(
 				dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, 
 				base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t),
@@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback(
 	 */
 
 	dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg;
+	cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d",
+			 base_header->type, base_header->subtype, msg_len, mode);
 
 	if (msg_len < sizeof(dfsm_message_state_header_t)) {
 		cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)",
@@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback(
 
 	if (mode == DFSM_MODE_SYNCED) {
 		if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) {
+			cfs_dom_debug(dfsm->log_domain, "received update complete message");
 
 			for (int i = 0; i < dfsm->sync_info->node_count; i++)
 				dfsm->sync_info->nodes[i].synced = 1;
@@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback(
 			return;
 
 		} else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) {
+			cfs_dom_debug(dfsm->log_domain, "received verify request message");
 
 			if (msg_len != sizeof(dfsm->csum_counter)) {
 				cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid);
@@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback(
 	} else if (mode == DFSM_MODE_START_SYNC) {
 
 		if (base_header->type == DFSM_MESSAGE_SYNC_START) {
-
 			if (nodeid != dfsm->lowest_nodeid) {
 				cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d",
 						 nodeid, pid);
@@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback(
 			return;
 
 		} else if (base_header->type == DFSM_MESSAGE_STATE) {
+			cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid);
 
 			dfsm_node_info_t *ni;
 			
@@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback(
 						goto leave;
 				}
 
+			} else {
+				cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more");
 			}
 
 			return;
@@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback(
 	} else if (mode == DFSM_MODE_UPDATE) {
 
 		if (base_header->type == DFSM_MESSAGE_UPDATE) {
+			cfs_dom_debug(dfsm->log_domain, "received update message");
 				
 			int res = dfsm->dfsm_callbacks->dfsm_process_update_fn(
 				dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len);
@@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback(
 
 		} else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) {
 
+			cfs_dom_debug(dfsm->log_domain, "received update complete message");
 
 			int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info);
 
@@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback(
 	size_t joined_list_entries)
 {
 	cs_error_t result;
+	cfs_debug("dfsm_cpg_confchg_callback called");
 
 	dfsm_t *dfsm = NULL;
 	result = cpg_context_get(handle, (gpointer *)&dfsm);




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-24 14:02                                                                                                         ` Fabian Grünbichler
@ 2020-09-24 14:29                                                                                                           ` Alexandre DERUMIER
  2020-09-24 18:07                                                                                                             ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-24 14:29 UTC (permalink / raw)
  To: Proxmox VE development discussion

Hi fabian,

>>if you are still able to test, it would be great if you could give the 
>>following packages a spin (they only contain some extra debug prints 
>>on message processing/sending): 

Sure, no problem, I'm going to test it tonight.


>>ideally, you could get the debug logs from all nodes, and the 
>>coredump/bt from the node where pmxcfs hangs. thanks! 

ok,no problem.

I'll keep you in touch tomorrow.




----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Jeudi 24 Septembre 2020 16:02:04
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: 
> I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, 
> but after that it's become available again. 
> 
> 
> So, it's really something when corosync is in shutdown phase, and pmxcfs is running. 
> 
> So, for now, as workaround, I have changed 
> 
> /lib/systemd/system/pve-cluster.service 
> 
> #Wants=corosync.service 
> #Before=corosync.service 
> Requires=corosync.service 
> After=corosync.service 
> 
> 
> Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. 

if you are still able to test, it would be great if you could give the 
following packages a spin (they only contain some extra debug prints 
on message processing/sending): 

http://download.proxmox.com/temp/pmxcfs-dbg/ 

64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 
04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb 

ideally, you could get the debug logs from all nodes, and the 
coredump/bt from the node where pmxcfs hangs. thanks! 

diff --git a/data/src/dfsm.c b/data/src/dfsm.c 
index 529c7f9..e0bd93f 100644 
--- a/data/src/dfsm.c 
+++ b/data/src/dfsm.c 
@@ -162,8 +162,8 @@ static void 
dfsm_send_sync_message_abort(dfsm_t *dfsm) 
{ 
g_return_if_fail(dfsm != NULL); 
- 
g_mutex_lock (&dfsm->sync_mutex); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); 
dfsm->msgcount_rcvd = dfsm->msgcount; 
g_cond_broadcast (&dfsm->sync_cond); 
g_mutex_unlock (&dfsm->sync_mutex); 
@@ -181,6 +181,7 @@ dfsm_record_local_result( 

g_mutex_lock (&dfsm->sync_mutex); 
dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); 
if (rp) { 
rp->result = msg_result; 
rp->processed = processed; 
@@ -235,6 +236,8 @@ dfsm_send_state_message_full( 
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); 
+ 
dfsm_message_state_header_t header; 
header.base.type = type; 
header.base.subtype = 0; 
@@ -317,6 +320,7 @@ dfsm_send_message_sync( 
for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); 
cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -335,10 +339,12 @@ dfsm_send_message_sync( 
if (rp) { 
g_mutex_lock (&dfsm->sync_mutex); 

- while (dfsm->msgcount_rcvd < msgcount) 
+ while (dfsm->msgcount_rcvd < msgcount) { 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); 
g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); 
+ } 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); 

- 
g_hash_table_remove(dfsm->results, &rp->msgcount); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( 
return; 
} 

+ cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); 
+ 
if (base_header->type == DFSM_MESSAGE_NORMAL) { 

dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", 
+ base_header->type, base_header->subtype, msg_len); 

if (msg_len < sizeof(dfsm_message_normal_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", 
@@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( 
} else { 

int msg_res = -1; 
+ cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", 
+ header->count, base_header->subtype, msg_len); 
int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( 
dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, 
base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), 
@@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( 
*/ 

dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", 
+ base_header->type, base_header->subtype, msg_len, mode); 

if (msg_len < sizeof(dfsm_message_state_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", 
@@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( 

if (mode == DFSM_MODE_SYNCED) { 
if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

for (int i = 0; i < dfsm->sync_info->node_count; i++) 
dfsm->sync_info->nodes[i].synced = 1; 
@@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { 
+ cfs_dom_debug(dfsm->log_domain, "received verify request message"); 

if (msg_len != sizeof(dfsm->csum_counter)) { 
cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); 
@@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_START_SYNC) { 

if (base_header->type == DFSM_MESSAGE_SYNC_START) { 
- 
if (nodeid != dfsm->lowest_nodeid) { 
cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", 
nodeid, pid); 
@@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_STATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); 

dfsm_node_info_t *ni; 

@@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( 
goto leave; 
} 

+ } else { 
+ cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); 
} 

return; 
@@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_UPDATE) { 

if (base_header->type == DFSM_MESSAGE_UPDATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update message"); 

int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( 
dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); 
@@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( 

} else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 

+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); 

@@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( 
size_t joined_list_entries) 
{ 
cs_error_t result; 
+ cfs_debug("dfsm_cpg_confchg_callback called"); 

dfsm_t *dfsm = NULL; 
result = cpg_context_get(handle, (gpointer *)&dfsm); 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-24 14:29                                                                                                           ` Alexandre DERUMIER
@ 2020-09-24 18:07                                                                                                             ` Alexandre DERUMIER
  2020-09-25  6:44                                                                                                               ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-24 18:07 UTC (permalink / raw)
  To: Proxmox VE development discussion

I was able to reproduce

stop corosync on node1 : 18:12:29
/etc/pve locked at 18:12:30

logs of all nodes are here:
http://odisoweb1.odiso.net/test1/


I don't have coredump, as my coworker have restarted pmxcfs too fast :/ . Sorry.


I'm going to launch another test with coredump this time



----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Jeudi 24 Septembre 2020 16:29:17
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Hi fabian, 

>>if you are still able to test, it would be great if you could give the 
>>following packages a spin (they only contain some extra debug prints 
>>on message processing/sending): 

Sure, no problem, I'm going to test it tonight. 


>>ideally, you could get the debug logs from all nodes, and the 
>>coredump/bt from the node where pmxcfs hangs. thanks! 

ok,no problem. 

I'll keep you in touch tomorrow. 




----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Jeudi 24 Septembre 2020 16:02:04 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: 
> I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, 
> but after that it's become available again. 
> 
> 
> So, it's really something when corosync is in shutdown phase, and pmxcfs is running. 
> 
> So, for now, as workaround, I have changed 
> 
> /lib/systemd/system/pve-cluster.service 
> 
> #Wants=corosync.service 
> #Before=corosync.service 
> Requires=corosync.service 
> After=corosync.service 
> 
> 
> Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. 

if you are still able to test, it would be great if you could give the 
following packages a spin (they only contain some extra debug prints 
on message processing/sending): 

http://download.proxmox.com/temp/pmxcfs-dbg/ 

64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 
04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb 

ideally, you could get the debug logs from all nodes, and the 
coredump/bt from the node where pmxcfs hangs. thanks! 

diff --git a/data/src/dfsm.c b/data/src/dfsm.c 
index 529c7f9..e0bd93f 100644 
--- a/data/src/dfsm.c 
+++ b/data/src/dfsm.c 
@@ -162,8 +162,8 @@ static void 
dfsm_send_sync_message_abort(dfsm_t *dfsm) 
{ 
g_return_if_fail(dfsm != NULL); 
- 
g_mutex_lock (&dfsm->sync_mutex); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); 
dfsm->msgcount_rcvd = dfsm->msgcount; 
g_cond_broadcast (&dfsm->sync_cond); 
g_mutex_unlock (&dfsm->sync_mutex); 
@@ -181,6 +181,7 @@ dfsm_record_local_result( 

g_mutex_lock (&dfsm->sync_mutex); 
dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); 
if (rp) { 
rp->result = msg_result; 
rp->processed = processed; 
@@ -235,6 +236,8 @@ dfsm_send_state_message_full( 
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); 
+ 
dfsm_message_state_header_t header; 
header.base.type = type; 
header.base.subtype = 0; 
@@ -317,6 +320,7 @@ dfsm_send_message_sync( 
for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); 
cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -335,10 +339,12 @@ dfsm_send_message_sync( 
if (rp) { 
g_mutex_lock (&dfsm->sync_mutex); 

- while (dfsm->msgcount_rcvd < msgcount) 
+ while (dfsm->msgcount_rcvd < msgcount) { 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); 
g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); 
+ } 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); 

- 
g_hash_table_remove(dfsm->results, &rp->msgcount); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( 
return; 
} 

+ cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); 
+ 
if (base_header->type == DFSM_MESSAGE_NORMAL) { 

dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", 
+ base_header->type, base_header->subtype, msg_len); 

if (msg_len < sizeof(dfsm_message_normal_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", 
@@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( 
} else { 

int msg_res = -1; 
+ cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", 
+ header->count, base_header->subtype, msg_len); 
int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( 
dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, 
base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), 
@@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( 
*/ 

dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", 
+ base_header->type, base_header->subtype, msg_len, mode); 

if (msg_len < sizeof(dfsm_message_state_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", 
@@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( 

if (mode == DFSM_MODE_SYNCED) { 
if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

for (int i = 0; i < dfsm->sync_info->node_count; i++) 
dfsm->sync_info->nodes[i].synced = 1; 
@@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { 
+ cfs_dom_debug(dfsm->log_domain, "received verify request message"); 

if (msg_len != sizeof(dfsm->csum_counter)) { 
cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); 
@@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_START_SYNC) { 

if (base_header->type == DFSM_MESSAGE_SYNC_START) { 
- 
if (nodeid != dfsm->lowest_nodeid) { 
cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", 
nodeid, pid); 
@@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_STATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); 

dfsm_node_info_t *ni; 

@@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( 
goto leave; 
} 

+ } else { 
+ cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); 
} 

return; 
@@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_UPDATE) { 

if (base_header->type == DFSM_MESSAGE_UPDATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update message"); 

int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( 
dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); 
@@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( 

} else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 

+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); 

@@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( 
size_t joined_list_entries) 
{ 
cs_error_t result; 
+ cfs_debug("dfsm_cpg_confchg_callback called"); 

dfsm_t *dfsm = NULL; 
result = cpg_context_get(handle, (gpointer *)&dfsm); 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-24 18:07                                                                                                             ` Alexandre DERUMIER
@ 2020-09-25  6:44                                                                                                               ` Alexandre DERUMIER
  2020-09-25  7:15                                                                                                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-25  6:44 UTC (permalink / raw)
  To: Proxmox VE development discussion

Another test this morning with the coredump available

http://odisoweb1.odiso.net/test2/


Something different this time, it has happened on corosync start


node1 (corosync start)
------
start corosync : 08:06:56


node2 (/etc/pve locked)
-----
Current time : 08:07:01



I had a warning on coredump

(gdb) generate-core-file
warning: target file /proc/35248/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.35248

I hope it's ok.


I'll do another test this morning

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Jeudi 24 Septembre 2020 20:07:43
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

I was able to reproduce 

stop corosync on node1 : 18:12:29 
/etc/pve locked at 18:12:30 

logs of all nodes are here: 
http://odisoweb1.odiso.net/test1/ 


I don't have coredump, as my coworker have restarted pmxcfs too fast :/ . Sorry. 


I'm going to launch another test with coredump this time 



----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Jeudi 24 Septembre 2020 16:29:17 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

Hi fabian, 

>>if you are still able to test, it would be great if you could give the 
>>following packages a spin (they only contain some extra debug prints 
>>on message processing/sending): 

Sure, no problem, I'm going to test it tonight. 


>>ideally, you could get the debug logs from all nodes, and the 
>>coredump/bt from the node where pmxcfs hangs. thanks! 

ok,no problem. 

I'll keep you in touch tomorrow. 




----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Jeudi 24 Septembre 2020 16:02:04 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: 
> I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, 
> but after that it's become available again. 
> 
> 
> So, it's really something when corosync is in shutdown phase, and pmxcfs is running. 
> 
> So, for now, as workaround, I have changed 
> 
> /lib/systemd/system/pve-cluster.service 
> 
> #Wants=corosync.service 
> #Before=corosync.service 
> Requires=corosync.service 
> After=corosync.service 
> 
> 
> Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. 

if you are still able to test, it would be great if you could give the 
following packages a spin (they only contain some extra debug prints 
on message processing/sending): 

http://download.proxmox.com/temp/pmxcfs-dbg/ 

64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 
04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb 

ideally, you could get the debug logs from all nodes, and the 
coredump/bt from the node where pmxcfs hangs. thanks! 

diff --git a/data/src/dfsm.c b/data/src/dfsm.c 
index 529c7f9..e0bd93f 100644 
--- a/data/src/dfsm.c 
+++ b/data/src/dfsm.c 
@@ -162,8 +162,8 @@ static void 
dfsm_send_sync_message_abort(dfsm_t *dfsm) 
{ 
g_return_if_fail(dfsm != NULL); 
- 
g_mutex_lock (&dfsm->sync_mutex); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); 
dfsm->msgcount_rcvd = dfsm->msgcount; 
g_cond_broadcast (&dfsm->sync_cond); 
g_mutex_unlock (&dfsm->sync_mutex); 
@@ -181,6 +181,7 @@ dfsm_record_local_result( 

g_mutex_lock (&dfsm->sync_mutex); 
dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); 
if (rp) { 
rp->result = msg_result; 
rp->processed = processed; 
@@ -235,6 +236,8 @@ dfsm_send_state_message_full( 
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); 
+ 
dfsm_message_state_header_t header; 
header.base.type = type; 
header.base.subtype = 0; 
@@ -317,6 +320,7 @@ dfsm_send_message_sync( 
for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); 
cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -335,10 +339,12 @@ dfsm_send_message_sync( 
if (rp) { 
g_mutex_lock (&dfsm->sync_mutex); 

- while (dfsm->msgcount_rcvd < msgcount) 
+ while (dfsm->msgcount_rcvd < msgcount) { 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); 
g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); 
+ } 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); 

- 
g_hash_table_remove(dfsm->results, &rp->msgcount); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( 
return; 
} 

+ cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); 
+ 
if (base_header->type == DFSM_MESSAGE_NORMAL) { 

dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", 
+ base_header->type, base_header->subtype, msg_len); 

if (msg_len < sizeof(dfsm_message_normal_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", 
@@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( 
} else { 

int msg_res = -1; 
+ cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", 
+ header->count, base_header->subtype, msg_len); 
int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( 
dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, 
base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), 
@@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( 
*/ 

dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", 
+ base_header->type, base_header->subtype, msg_len, mode); 

if (msg_len < sizeof(dfsm_message_state_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", 
@@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( 

if (mode == DFSM_MODE_SYNCED) { 
if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

for (int i = 0; i < dfsm->sync_info->node_count; i++) 
dfsm->sync_info->nodes[i].synced = 1; 
@@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { 
+ cfs_dom_debug(dfsm->log_domain, "received verify request message"); 

if (msg_len != sizeof(dfsm->csum_counter)) { 
cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); 
@@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_START_SYNC) { 

if (base_header->type == DFSM_MESSAGE_SYNC_START) { 
- 
if (nodeid != dfsm->lowest_nodeid) { 
cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", 
nodeid, pid); 
@@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_STATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); 

dfsm_node_info_t *ni; 

@@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( 
goto leave; 
} 

+ } else { 
+ cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); 
} 

return; 
@@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_UPDATE) { 

if (base_header->type == DFSM_MESSAGE_UPDATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update message"); 

int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( 
dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); 
@@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( 

} else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 

+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); 

@@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( 
size_t joined_list_entries) 
{ 
cs_error_t result; 
+ cfs_debug("dfsm_cpg_confchg_callback called"); 

dfsm_t *dfsm = NULL; 
result = cpg_context_get(handle, (gpointer *)&dfsm); 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-25  6:44                                                                                                               ` Alexandre DERUMIER
@ 2020-09-25  7:15                                                                                                                 ` Alexandre DERUMIER
  2020-09-25  9:19                                                                                                                   ` Fabian Grünbichler
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-25  7:15 UTC (permalink / raw)
  To: Proxmox VE development discussion


Another hang, this time on corosync stop, coredump available

http://odisoweb1.odiso.net/test3/ 

                                                                                                             
node1
----
stop corosync : 09:03:10

node2: /etc/pve locked
------
Current time : 09:03:10




----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Vendredi 25 Septembre 2020 08:44:24
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Another test this morning with the coredump available 

http://odisoweb1.odiso.net/test2/ 


Something different this time, it has happened on corosync start 


node1 (corosync start) 
------ 
start corosync : 08:06:56 


node2 (/etc/pve locked) 
----- 
Current time : 08:07:01 



I had a warning on coredump 

(gdb) generate-core-file 
warning: target file /proc/35248/cmdline contained unexpected null characters 
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000. 
Saved corefile core.35248 

I hope it's ok. 


I'll do another test this morning 

----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Jeudi 24 Septembre 2020 20:07:43 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

I was able to reproduce 

stop corosync on node1 : 18:12:29 
/etc/pve locked at 18:12:30 

logs of all nodes are here: 
http://odisoweb1.odiso.net/test1/ 


I don't have coredump, as my coworker have restarted pmxcfs too fast :/ . Sorry. 


I'm going to launch another test with coredump this time 



----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Jeudi 24 Septembre 2020 16:29:17 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

Hi fabian, 

>>if you are still able to test, it would be great if you could give the 
>>following packages a spin (they only contain some extra debug prints 
>>on message processing/sending): 

Sure, no problem, I'm going to test it tonight. 


>>ideally, you could get the debug logs from all nodes, and the 
>>coredump/bt from the node where pmxcfs hangs. thanks! 

ok,no problem. 

I'll keep you in touch tomorrow. 




----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Jeudi 24 Septembre 2020 16:02:04 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: 
> I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, 
> but after that it's become available again. 
> 
> 
> So, it's really something when corosync is in shutdown phase, and pmxcfs is running. 
> 
> So, for now, as workaround, I have changed 
> 
> /lib/systemd/system/pve-cluster.service 
> 
> #Wants=corosync.service 
> #Before=corosync.service 
> Requires=corosync.service 
> After=corosync.service 
> 
> 
> Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. 

if you are still able to test, it would be great if you could give the 
following packages a spin (they only contain some extra debug prints 
on message processing/sending): 

http://download.proxmox.com/temp/pmxcfs-dbg/ 

64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 
04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb 

ideally, you could get the debug logs from all nodes, and the 
coredump/bt from the node where pmxcfs hangs. thanks! 

diff --git a/data/src/dfsm.c b/data/src/dfsm.c 
index 529c7f9..e0bd93f 100644 
--- a/data/src/dfsm.c 
+++ b/data/src/dfsm.c 
@@ -162,8 +162,8 @@ static void 
dfsm_send_sync_message_abort(dfsm_t *dfsm) 
{ 
g_return_if_fail(dfsm != NULL); 
- 
g_mutex_lock (&dfsm->sync_mutex); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); 
dfsm->msgcount_rcvd = dfsm->msgcount; 
g_cond_broadcast (&dfsm->sync_cond); 
g_mutex_unlock (&dfsm->sync_mutex); 
@@ -181,6 +181,7 @@ dfsm_record_local_result( 

g_mutex_lock (&dfsm->sync_mutex); 
dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); 
if (rp) { 
rp->result = msg_result; 
rp->processed = processed; 
@@ -235,6 +236,8 @@ dfsm_send_state_message_full( 
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); 
+ 
dfsm_message_state_header_t header; 
header.base.type = type; 
header.base.subtype = 0; 
@@ -317,6 +320,7 @@ dfsm_send_message_sync( 
for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); 
cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -335,10 +339,12 @@ dfsm_send_message_sync( 
if (rp) { 
g_mutex_lock (&dfsm->sync_mutex); 

- while (dfsm->msgcount_rcvd < msgcount) 
+ while (dfsm->msgcount_rcvd < msgcount) { 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); 
g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); 
+ } 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); 

- 
g_hash_table_remove(dfsm->results, &rp->msgcount); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( 
return; 
} 

+ cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); 
+ 
if (base_header->type == DFSM_MESSAGE_NORMAL) { 

dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", 
+ base_header->type, base_header->subtype, msg_len); 

if (msg_len < sizeof(dfsm_message_normal_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", 
@@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( 
} else { 

int msg_res = -1; 
+ cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", 
+ header->count, base_header->subtype, msg_len); 
int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( 
dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, 
base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), 
@@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( 
*/ 

dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", 
+ base_header->type, base_header->subtype, msg_len, mode); 

if (msg_len < sizeof(dfsm_message_state_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", 
@@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( 

if (mode == DFSM_MODE_SYNCED) { 
if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

for (int i = 0; i < dfsm->sync_info->node_count; i++) 
dfsm->sync_info->nodes[i].synced = 1; 
@@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { 
+ cfs_dom_debug(dfsm->log_domain, "received verify request message"); 

if (msg_len != sizeof(dfsm->csum_counter)) { 
cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); 
@@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_START_SYNC) { 

if (base_header->type == DFSM_MESSAGE_SYNC_START) { 
- 
if (nodeid != dfsm->lowest_nodeid) { 
cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", 
nodeid, pid); 
@@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_STATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); 

dfsm_node_info_t *ni; 

@@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( 
goto leave; 
} 

+ } else { 
+ cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); 
} 

return; 
@@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_UPDATE) { 

if (base_header->type == DFSM_MESSAGE_UPDATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update message"); 

int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( 
dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); 
@@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( 

} else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 

+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); 

@@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( 
size_t joined_list_entries) 
{ 
cs_error_t result; 
+ cfs_debug("dfsm_cpg_confchg_callback called"); 

dfsm_t *dfsm = NULL; 
result = cpg_context_get(handle, (gpointer *)&dfsm); 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-25  7:15                                                                                                                 ` Alexandre DERUMIER
@ 2020-09-25  9:19                                                                                                                   ` Fabian Grünbichler
  2020-09-25  9:46                                                                                                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Fabian Grünbichler @ 2020-09-25  9:19 UTC (permalink / raw)
  To: Proxmox VE development discussion

On September 25, 2020 9:15 am, Alexandre DERUMIER wrote:
> 
> Another hang, this time on corosync stop, coredump available
> 
> http://odisoweb1.odiso.net/test3/ 
> 
>                                                                                                              
> node1
> ----
> stop corosync : 09:03:10
> 
> node2: /etc/pve locked
> ------
> Current time : 09:03:10

thanks, these all indicate the same symptoms:

1. cluster config changes (corosync goes down/comes back up in this case)
2. pmxcfs starts sync process
3. all (online) nodes receive sync request for dcdb and status
4. all nodes send state for dcdb and status via CPG
5. all nodes receive state for dcdb and status from all nodes except one (13 in test 2, 10 in test 3)

in step 5, there is no trace of the message on the receiving side, even 
though the sending node does not log an error. as before, the hang is 
just a side-effect of the state machine ending up in a state that should 
be short-lived (syncing, waiting for state from all nodes) with no 
progress. the code and theory say that this should not happen, as either 
sending the state fails triggering the node to leave the CPG (restarting 
the sync), or a node drops out of quorum (triggering a config change, 
which triggers restarting the sync), or we get all states from all nodes 
and the sync proceeds. this looks to me like a fundamental 
assumption/guarantee does not hold..

I will rebuild once more modifying the send code a bit to log a lot more 
details when sending state messages, it would be great if you could 
repeat with that as we are still unable to reproduce the issue. 
hopefully those logs will then indicate whether this is a corosync/knet 
bug, or if the issue is in our state machine code somewhere. so far it 
looks more like the former..




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-25  9:19                                                                                                                   ` Fabian Grünbichler
@ 2020-09-25  9:46                                                                                                                     ` Alexandre DERUMIER
  2020-09-25 12:51                                                                                                                       ` Fabian Grünbichler
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-25  9:46 UTC (permalink / raw)
  To: Proxmox VE development discussion

>>I will rebuild once more modifying the send code a bit to log a lot more
>>details when sending state messages, it would be great if you could
>>repeat with that as we are still unable to reproduce the issue.

ok, no problem, I'm able to easily reproduce it, I'll do new test when you'll
send the new version.

(and thanks again to debugging this, because It's really beyond my competence)



----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Vendredi 25 Septembre 2020 11:19:04
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On September 25, 2020 9:15 am, Alexandre DERUMIER wrote: 
> 
> Another hang, this time on corosync stop, coredump available 
> 
> http://odisoweb1.odiso.net/test3/ 
> 
> 
> node1 
> ---- 
> stop corosync : 09:03:10 
> 
> node2: /etc/pve locked 
> ------ 
> Current time : 09:03:10 

thanks, these all indicate the same symptoms: 

1. cluster config changes (corosync goes down/comes back up in this case) 
2. pmxcfs starts sync process 
3. all (online) nodes receive sync request for dcdb and status 
4. all nodes send state for dcdb and status via CPG 
5. all nodes receive state for dcdb and status from all nodes except one (13 in test 2, 10 in test 3) 

in step 5, there is no trace of the message on the receiving side, even 
though the sending node does not log an error. as before, the hang is 
just a side-effect of the state machine ending up in a state that should 
be short-lived (syncing, waiting for state from all nodes) with no 
progress. the code and theory say that this should not happen, as either 
sending the state fails triggering the node to leave the CPG (restarting 
the sync), or a node drops out of quorum (triggering a config change, 
which triggers restarting the sync), or we get all states from all nodes 
and the sync proceeds. this looks to me like a fundamental 
assumption/guarantee does not hold.. 

I will rebuild once more modifying the send code a bit to log a lot more 
details when sending state messages, it would be great if you could 
repeat with that as we are still unable to reproduce the issue. 
hopefully those logs will then indicate whether this is a corosync/knet 
bug, or if the issue is in our state machine code somewhere. so far it 
looks more like the former.. 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-25  9:46                                                                                                                     ` Alexandre DERUMIER
@ 2020-09-25 12:51                                                                                                                       ` Fabian Grünbichler
  2020-09-25 16:29                                                                                                                         ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Fabian Grünbichler @ 2020-09-25 12:51 UTC (permalink / raw)
  To: Proxmox VE development discussion

On September 25, 2020 11:46 am, Alexandre DERUMIER wrote:
>>>I will rebuild once more modifying the send code a bit to log a lot more
>>>details when sending state messages, it would be great if you could
>>>repeat with that as we are still unable to reproduce the issue.
> 
> ok, no problem, I'm able to easily reproduce it, I'll do new test when you'll
> send the new version.
> 
> (and thanks again to debugging this, because It's really beyond my competence)

same procedure as last time, same place, new checksums:

6b5e2defe543a874e0f7a883e40b279c997438dde158566c4c93c11ea531aef924ed2e4eb2506b5064e49ec9bdd4ebe7acd0b9278e9286eac527b0b15a43d8d7  pve-cluster_6.1-8_amd64.deb
d57ddc08824055826ee15c9c255690d9140e43f8d5164949108f0dc483a2d181b2bda76f0e7f47202699a062342c0cf0bba8f2ae0f7c5411af9967ef051050a0  pve-cluster-dbgsym_6.1-8_amd64.deb

I found one (unfortunately unrelated) bug in our error handling, so 
there's that at least ;)

diff --git a/data/src/dfsm.c b/data/src/dfsm.c
index 529c7f9..f3397a0 100644
--- a/data/src/dfsm.c
+++ b/data/src/dfsm.c
@@ -162,8 +162,8 @@ static void
 dfsm_send_sync_message_abort(dfsm_t *dfsm)
 {
 	g_return_if_fail(dfsm != NULL);
-
 	g_mutex_lock (&dfsm->sync_mutex);
+	cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount);
 	dfsm->msgcount_rcvd = dfsm->msgcount;
 	g_cond_broadcast (&dfsm->sync_cond);
 	g_mutex_unlock (&dfsm->sync_mutex);
@@ -181,6 +181,7 @@ dfsm_record_local_result(
 
 	g_mutex_lock (&dfsm->sync_mutex);
 	dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count);
+	cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64":  %d", msg_count, msg_result);
 	if (rp) {
 		rp->result = msg_result;
 		rp->processed = processed;
@@ -224,6 +225,48 @@ loop:
 	return result;
 }
 
+static cs_error_t 
+dfsm_send_message_full_debug_state(
+	dfsm_t *dfsm,
+	struct iovec *iov, 
+	unsigned int len,
+	int retry)
+{
+	g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM);
+	g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM);
+
+	struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 };
+	cs_error_t result;
+	int retries = 0;
+	cfs_dom_message(dfsm->log_domain, "send state message debug");
+	cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle);
+	for (int i = 0; i < len; i++)
+		cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len);
+loop:
+	cfs_dom_message(dfsm->log_domain, "send state message loop body");
+
+	result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
+
+	cfs_dom_message(dfsm->log_domain, "send state message result: %d", result);
+	if (retry && result == CS_ERR_TRY_AGAIN) {
+		nanosleep(&tvreq, NULL);
+		++retries;
+		if ((retries % 10) == 0)
+			cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);
+		if (retries < 100)
+			goto loop;
+	}
+
+	if (retries)
+		cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries);
+
+	if (result != CS_OK &&
+	    (!retry || result != CS_ERR_TRY_AGAIN))
+		cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result);
+
+	return result;
+}
+
 static cs_error_t 
 dfsm_send_state_message_full(
 	dfsm_t *dfsm,
@@ -235,6 +278,8 @@ dfsm_send_state_message_full(
 	g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM);
 	g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM);
 
+	cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len);
+
 	dfsm_message_state_header_t header;
 	header.base.type = type;
 	header.base.subtype = 0;
@@ -252,7 +297,7 @@ dfsm_send_state_message_full(
 	for (int i = 0; i < len; i++)
 		real_iov[i + 1] = iov[i];
 
-	return dfsm_send_message_full(dfsm, real_iov, len + 1, 1);
+	return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1);
 }
 
 cs_error_t 
@@ -317,6 +362,7 @@ dfsm_send_message_sync(
 	for (int i = 0; i < len; i++)
 		real_iov[i + 1] = iov[i];
 
+	cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, count %" PRIu64 ", msgtype %d, len %d", msgcount, msgtype, len);
 	cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1);
 
 	g_mutex_unlock (&dfsm->sync_mutex);
@@ -335,10 +381,12 @@ dfsm_send_message_sync(
 	if (rp) {
 		g_mutex_lock (&dfsm->sync_mutex);
 
-		while (dfsm->msgcount_rcvd < msgcount)
+		while (dfsm->msgcount_rcvd < msgcount) {
+			cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount);
 			g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex);
+		}
+		cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!");
 
-      
 		g_hash_table_remove(dfsm->results, &rp->msgcount);
 		
 		g_mutex_unlock (&dfsm->sync_mutex);
@@ -685,9 +733,13 @@ dfsm_cpg_deliver_callback(
 		return;
 	}
 
+	cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type);
+
 	if (base_header->type == DFSM_MESSAGE_NORMAL) {
 
 		dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg;
+		cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)",
+			      base_header->type, base_header->subtype, msg_len);
 
 		if (msg_len < sizeof(dfsm_message_normal_header_t)) {
 			cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)",
@@ -704,6 +756,8 @@ dfsm_cpg_deliver_callback(
 		} else {
 
 			int msg_res = -1;
+			cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)",
+				      header->count, base_header->subtype, msg_len); 
 			int res = dfsm->dfsm_callbacks->dfsm_deliver_fn(
 				dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, 
 				base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t),
@@ -724,6 +778,8 @@ dfsm_cpg_deliver_callback(
 	 */
 
 	dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg;
+	cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d",
+			 base_header->type, base_header->subtype, msg_len, mode);
 
 	if (msg_len < sizeof(dfsm_message_state_header_t)) {
 		cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)",
@@ -744,6 +800,7 @@ dfsm_cpg_deliver_callback(
 
 	if (mode == DFSM_MODE_SYNCED) {
 		if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) {
+			cfs_dom_debug(dfsm->log_domain, "received update complete message");
 
 			for (int i = 0; i < dfsm->sync_info->node_count; i++)
 				dfsm->sync_info->nodes[i].synced = 1;
@@ -754,6 +811,7 @@ dfsm_cpg_deliver_callback(
 			return;
 
 		} else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) {
+			cfs_dom_debug(dfsm->log_domain, "received verify request message");
 
 			if (msg_len != sizeof(dfsm->csum_counter)) {
 				cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid);
@@ -823,7 +881,6 @@ dfsm_cpg_deliver_callback(
 	} else if (mode == DFSM_MODE_START_SYNC) {
 
 		if (base_header->type == DFSM_MESSAGE_SYNC_START) {
-
 			if (nodeid != dfsm->lowest_nodeid) {
 				cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d",
 						 nodeid, pid);
@@ -861,6 +918,7 @@ dfsm_cpg_deliver_callback(
 			return;
 
 		} else if (base_header->type == DFSM_MESSAGE_STATE) {
+			cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid);
 
 			dfsm_node_info_t *ni;
 			
@@ -906,6 +964,8 @@ dfsm_cpg_deliver_callback(
 						goto leave;
 				}
 
+			} else {
+				cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more");
 			}
 
 			return;
@@ -914,6 +974,7 @@ dfsm_cpg_deliver_callback(
 	} else if (mode == DFSM_MODE_UPDATE) {
 
 		if (base_header->type == DFSM_MESSAGE_UPDATE) {
+			cfs_dom_debug(dfsm->log_domain, "received update message");
 				
 			int res = dfsm->dfsm_callbacks->dfsm_process_update_fn(
 				dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len);
@@ -925,6 +986,7 @@ dfsm_cpg_deliver_callback(
 
 		} else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) {
 
+			cfs_dom_debug(dfsm->log_domain, "received update complete message");
 
 			int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info);
 
@@ -1104,6 +1166,7 @@ dfsm_cpg_confchg_callback(
 	size_t joined_list_entries)
 {
 	cs_error_t result;
+	cfs_debug("dfsm_cpg_confchg_callback called");
 
 	dfsm_t *dfsm = NULL;
 	result = cpg_context_get(handle, (gpointer *)&dfsm);
@@ -1190,7 +1253,7 @@ dfsm_cpg_confchg_callback(
 
 		dfsm_set_mode(dfsm, DFSM_MODE_START_SYNC);
 		if (lowest_nodeid == dfsm->nodeid) {
-			if (!dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0)) {
+			if (dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0) != CS_OK) {
 				cfs_dom_critical(dfsm->log_domain, "failed to send SYNC_START message");
 				goto leave;
 			}




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-25 12:51                                                                                                                       ` Fabian Grünbichler
@ 2020-09-25 16:29                                                                                                                         ` Alexandre DERUMIER
  2020-09-28  9:17                                                                                                                           ` Fabian Grünbichler
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-25 16:29 UTC (permalink / raw)
  To: Proxmox VE development discussion

here a new hang:

http://odisoweb1.odiso.net/test4/


This time on corosync start.



node1:
-----
start corosync : 17:22:02


node2
-----
/etc/pve locked 17:22:07



Something new: where doing coredump or bt-full on pmxcfs on node1,
this have unlocked /etc/pve on other nodes

/etc/pve unlocked(with coredump or bt-full): 17:57:40



----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Vendredi 25 Septembre 2020 14:51:30
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On September 25, 2020 11:46 am, Alexandre DERUMIER wrote: 
>>>I will rebuild once more modifying the send code a bit to log a lot more 
>>>details when sending state messages, it would be great if you could 
>>>repeat with that as we are still unable to reproduce the issue. 
> 
> ok, no problem, I'm able to easily reproduce it, I'll do new test when you'll 
> send the new version. 
> 
> (and thanks again to debugging this, because It's really beyond my competence) 

same procedure as last time, same place, new checksums: 

6b5e2defe543a874e0f7a883e40b279c997438dde158566c4c93c11ea531aef924ed2e4eb2506b5064e49ec9bdd4ebe7acd0b9278e9286eac527b0b15a43d8d7 pve-cluster_6.1-8_amd64.deb 
d57ddc08824055826ee15c9c255690d9140e43f8d5164949108f0dc483a2d181b2bda76f0e7f47202699a062342c0cf0bba8f2ae0f7c5411af9967ef051050a0 pve-cluster-dbgsym_6.1-8_amd64.deb 

I found one (unfortunately unrelated) bug in our error handling, so 
there's that at least ;) 

diff --git a/data/src/dfsm.c b/data/src/dfsm.c 
index 529c7f9..f3397a0 100644 
--- a/data/src/dfsm.c 
+++ b/data/src/dfsm.c 
@@ -162,8 +162,8 @@ static void 
dfsm_send_sync_message_abort(dfsm_t *dfsm) 
{ 
g_return_if_fail(dfsm != NULL); 
- 
g_mutex_lock (&dfsm->sync_mutex); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); 
dfsm->msgcount_rcvd = dfsm->msgcount; 
g_cond_broadcast (&dfsm->sync_cond); 
g_mutex_unlock (&dfsm->sync_mutex); 
@@ -181,6 +181,7 @@ dfsm_record_local_result( 

g_mutex_lock (&dfsm->sync_mutex); 
dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); 
if (rp) { 
rp->result = msg_result; 
rp->processed = processed; 
@@ -224,6 +225,48 @@ loop: 
return result; 
} 

+static cs_error_t 
+dfsm_send_message_full_debug_state( 
+ dfsm_t *dfsm, 
+ struct iovec *iov, 
+ unsigned int len, 
+ int retry) 
+{ 
+ g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); 
+ g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 
+ 
+ struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; 
+ cs_error_t result; 
+ int retries = 0; 
+ cfs_dom_message(dfsm->log_domain, "send state message debug"); 
+ cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); 
+ for (int i = 0; i < len; i++) 
+ cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); 
+loop: 
+ cfs_dom_message(dfsm->log_domain, "send state message loop body"); 
+ 
+ result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); 
+ 
+ cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); 
+ if (retry && result == CS_ERR_TRY_AGAIN) { 
+ nanosleep(&tvreq, NULL); 
+ ++retries; 
+ if ((retries % 10) == 0) 
+ cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); 
+ if (retries < 100) 
+ goto loop; 
+ } 
+ 
+ if (retries) 
+ cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); 
+ 
+ if (result != CS_OK && 
+ (!retry || result != CS_ERR_TRY_AGAIN)) 
+ cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); 
+ 
+ return result; 
+} 
+ 
static cs_error_t 
dfsm_send_state_message_full( 
dfsm_t *dfsm, 
@@ -235,6 +278,8 @@ dfsm_send_state_message_full( 
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); 
+ 
dfsm_message_state_header_t header; 
header.base.type = type; 
header.base.subtype = 0; 
@@ -252,7 +297,7 @@ dfsm_send_state_message_full( 
for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

- return dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 
+ return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 
} 

cs_error_t 
@@ -317,6 +362,7 @@ dfsm_send_message_sync( 
for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, count %" PRIu64 ", msgtype %d, len %d", msgcount, msgtype, len); 
cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -335,10 +381,12 @@ dfsm_send_message_sync( 
if (rp) { 
g_mutex_lock (&dfsm->sync_mutex); 

- while (dfsm->msgcount_rcvd < msgcount) 
+ while (dfsm->msgcount_rcvd < msgcount) { 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); 
g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); 
+ } 
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); 

- 
g_hash_table_remove(dfsm->results, &rp->msgcount); 

g_mutex_unlock (&dfsm->sync_mutex); 
@@ -685,9 +733,13 @@ dfsm_cpg_deliver_callback( 
return; 
} 

+ cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); 
+ 
if (base_header->type == DFSM_MESSAGE_NORMAL) { 

dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", 
+ base_header->type, base_header->subtype, msg_len); 

if (msg_len < sizeof(dfsm_message_normal_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", 
@@ -704,6 +756,8 @@ dfsm_cpg_deliver_callback( 
} else { 

int msg_res = -1; 
+ cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", 
+ header->count, base_header->subtype, msg_len); 
int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( 
dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, 
base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), 
@@ -724,6 +778,8 @@ dfsm_cpg_deliver_callback( 
*/ 

dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; 
+ cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", 
+ base_header->type, base_header->subtype, msg_len, mode); 

if (msg_len < sizeof(dfsm_message_state_header_t)) { 
cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", 
@@ -744,6 +800,7 @@ dfsm_cpg_deliver_callback( 

if (mode == DFSM_MODE_SYNCED) { 
if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

for (int i = 0; i < dfsm->sync_info->node_count; i++) 
dfsm->sync_info->nodes[i].synced = 1; 
@@ -754,6 +811,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { 
+ cfs_dom_debug(dfsm->log_domain, "received verify request message"); 

if (msg_len != sizeof(dfsm->csum_counter)) { 
cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); 
@@ -823,7 +881,6 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_START_SYNC) { 

if (base_header->type == DFSM_MESSAGE_SYNC_START) { 
- 
if (nodeid != dfsm->lowest_nodeid) { 
cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", 
nodeid, pid); 
@@ -861,6 +918,7 @@ dfsm_cpg_deliver_callback( 
return; 

} else if (base_header->type == DFSM_MESSAGE_STATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); 

dfsm_node_info_t *ni; 

@@ -906,6 +964,8 @@ dfsm_cpg_deliver_callback( 
goto leave; 
} 

+ } else { 
+ cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); 
} 

return; 
@@ -914,6 +974,7 @@ dfsm_cpg_deliver_callback( 
} else if (mode == DFSM_MODE_UPDATE) { 

if (base_header->type == DFSM_MESSAGE_UPDATE) { 
+ cfs_dom_debug(dfsm->log_domain, "received update message"); 

int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( 
dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); 
@@ -925,6 +986,7 @@ dfsm_cpg_deliver_callback( 

} else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { 

+ cfs_dom_debug(dfsm->log_domain, "received update complete message"); 

int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); 

@@ -1104,6 +1166,7 @@ dfsm_cpg_confchg_callback( 
size_t joined_list_entries) 
{ 
cs_error_t result; 
+ cfs_debug("dfsm_cpg_confchg_callback called"); 

dfsm_t *dfsm = NULL; 
result = cpg_context_get(handle, (gpointer *)&dfsm); 
@@ -1190,7 +1253,7 @@ dfsm_cpg_confchg_callback( 

dfsm_set_mode(dfsm, DFSM_MODE_START_SYNC); 
if (lowest_nodeid == dfsm->nodeid) { 
- if (!dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0)) { 
+ if (dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0) != CS_OK) { 
cfs_dom_critical(dfsm->log_domain, "failed to send SYNC_START message"); 
goto leave; 
} 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-25 16:29                                                                                                                         ` Alexandre DERUMIER
@ 2020-09-28  9:17                                                                                                                           ` Fabian Grünbichler
  2020-09-28  9:35                                                                                                                             ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Fabian Grünbichler @ 2020-09-28  9:17 UTC (permalink / raw)
  To: Proxmox VE development discussion

On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote:
> here a new hang:
> 
> http://odisoweb1.odiso.net/test4/

okay, so at least we now know something strange inside pmxcfs is going 
on, and not inside corosync - we never reach the part where the broken 
node (again #13 in this hang!) sends out the state message:

Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c:706:dfsm_cpg_deliver_callback)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's header type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message (type = 1, subtype = 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deliver_callback)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_sync_resources (dfsm.c:640:dfsm_release_sync_resources)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 00000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state)
[...]
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_message_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full)

this should be followed by output like this (from the -unblock log, 
where the sync went through):

Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message debug (dfsm.c:241:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997161967616 (dfsm.c:242:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c:244:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfsm.c:244:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message loop body (dfsm.c:246:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message result: 1 (dfsm.c:250:dfsm_send_message_full_debug_state)

but this never happens. the relevant code from our patched dfsm.c 
(reordered):

static cs_error_t 
dfsm_send_state_message_full(
	dfsm_t *dfsm,
	uint16_t type,
	struct iovec *iov, 
	unsigned int len) 
{
	g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM);
	g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM);
	g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM);

// this message is still logged
	cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len);

// everything below this point might not have happened anymore

	dfsm_message_state_header_t header;
	header.base.type = type;
	header.base.subtype = 0;
	header.base.protocol_version = dfsm->protocol_version;
	header.base.time = time(NULL);
	header.base.reserved = 0;

	header.epoch = dfsm->sync_epoch;

	struct iovec real_iov[len + 1];

	real_iov[0].iov_base = (char *)&header;
	real_iov[0].iov_len = sizeof(header);

	for (int i = 0; i < len; i++)
		real_iov[i + 1] = iov[i];

	return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1);
}

static cs_error_t 
dfsm_send_message_full_debug_state(
	dfsm_t *dfsm,
	struct iovec *iov, 
	unsigned int len,
	int retry)
{
	g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM);
	g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM);

	struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 };
	cs_error_t result;
	int retries = 0;

// this message is not visible in log
// we don't know how far above this we managed to run
	cfs_dom_message(dfsm->log_domain, "send state message debug");
	cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle);
	for (int i = 0; i < len; i++)
		cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len);
loop:
	cfs_dom_message(dfsm->log_domain, "send state message loop body");

	result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);

	cfs_dom_message(dfsm->log_domain, "send state message result: %d", result);
	if (retry && result == CS_ERR_TRY_AGAIN) {
		nanosleep(&tvreq, NULL);
		++retries;
		if ((retries % 10) == 0)
			cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);
		if (retries < 100)
			goto loop;
	}

	if (retries)
		cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries);

	if (result != CS_OK &&
	    (!retry || result != CS_ERR_TRY_AGAIN))
		cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result);

	return result;
}

I don't see much that could go wrong inside dfsm_send_state_message_full 
after the debug log statement (it's just filling out the header and iov 
structs, and calling 'dfsm_send_message_full_debug_state' since type == 
2 == DFSM_MESSAGE_STATE).

inside dfsm_send_message_full_debug, before the first log statement we 
only check for dfsm or the message length/content being NULL/0, all of 
which can't really happen with that call path. also, in that case we'd 
return CS_ERR_INVALID_PARAM , which would bubble up into the delivery 
callback and cause us to leave CPG, which would again be visible in the 
logs..

but, just to make sure, could you reproduce the issue once more, and 
then (with debug symbols installed) run

$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs)

on all nodes at the same time? this should minimize the fallout and show 
us whether the thread that logged the first part of sending the state is 
still around on the node that triggers the hang..

> Something new: where doing coredump or bt-full on pmxcfs on node1,
> this have unlocked /etc/pve on other nodes
> 
> /etc/pve unlocked(with coredump or bt-full): 17:57:40

this looks like you bt-ed corosync this time around? if so, than this is 
expected:

- attach gdb to corosync
- corosync blocks
- other nodes notice corosync is gone on node X
- config change triggered
- sync restarts on all nodes, does not trigger bug this time




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-28  9:17                                                                                                                           ` Fabian Grünbichler
@ 2020-09-28  9:35                                                                                                                             ` Alexandre DERUMIER
  2020-09-28 15:59                                                                                                                               ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-28  9:35 UTC (permalink / raw)
  To: Proxmox VE development discussion

>>but, just to make sure, could you reproduce the issue once more, and
>>then (with debug symbols installed) run
>>
>>$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs)
>>
>>on all nodes at the same time? this should minimize the fallout and show
>>us whether the thread that logged the first part of sending the state is
>>still around on the node that triggers the hang..

ok, no problem, I'll do a new test today



>>this looks like you bt-ed corosync this time around? if so, than this is 
>>expected: 
>>
>>- attach gdb to corosync 
>>- corosync blocks 
>>- other nodes notice corosync is gone on node X 
>>- config change triggered 
>>- sync restarts on all nodes, does not trigger bug this time 

ok, thanks, make sense. I didn't notice this previously, but this time it was with corosync started.


----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 28 Septembre 2020 11:17:37
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote: 
> here a new hang: 
> 
> http://odisoweb1.odiso.net/test4/ 

okay, so at least we now know something strange inside pmxcfs is going 
on, and not inside corosync - we never reach the part where the broken 
node (again #13 in this hang!) sends out the state message: 

Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c:706:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's header type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message (type = 1, subtype = 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_sync_resources (dfsm.c:640:dfsm_release_sync_resources) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 00000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state) 
[...] 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_message_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full) 

this should be followed by output like this (from the -unblock log, 
where the sync went through): 

Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message debug (dfsm.c:241:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997161967616 (dfsm.c:242:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c:244:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfsm.c:244:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message loop body (dfsm.c:246:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message result: 1 (dfsm.c:250:dfsm_send_message_full_debug_state) 

but this never happens. the relevant code from our patched dfsm.c 
(reordered): 

static cs_error_t 
dfsm_send_state_message_full( 
dfsm_t *dfsm, 
uint16_t type, 
struct iovec *iov, 
unsigned int len) 
{ 
g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

// this message is still logged 
cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); 

// everything below this point might not have happened anymore 

dfsm_message_state_header_t header; 
header.base.type = type; 
header.base.subtype = 0; 
header.base.protocol_version = dfsm->protocol_version; 
header.base.time = time(NULL); 
header.base.reserved = 0; 

header.epoch = dfsm->sync_epoch; 

struct iovec real_iov[len + 1]; 

real_iov[0].iov_base = (char *)&header; 
real_iov[0].iov_len = sizeof(header); 

for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 
} 

static cs_error_t 
dfsm_send_message_full_debug_state( 
dfsm_t *dfsm, 
struct iovec *iov, 
unsigned int len, 
int retry) 
{ 
g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; 
cs_error_t result; 
int retries = 0; 

// this message is not visible in log 
// we don't know how far above this we managed to run 
cfs_dom_message(dfsm->log_domain, "send state message debug"); 
cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); 
for (int i = 0; i < len; i++) 
cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); 
loop: 
cfs_dom_message(dfsm->log_domain, "send state message loop body"); 

result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); 

cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); 
if (retry && result == CS_ERR_TRY_AGAIN) { 
nanosleep(&tvreq, NULL); 
++retries; 
if ((retries % 10) == 0) 
cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); 
if (retries < 100) 
goto loop; 
} 

if (retries) 
cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); 

if (result != CS_OK && 
(!retry || result != CS_ERR_TRY_AGAIN)) 
cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); 

return result; 
} 

I don't see much that could go wrong inside dfsm_send_state_message_full 
after the debug log statement (it's just filling out the header and iov 
structs, and calling 'dfsm_send_message_full_debug_state' since type == 
2 == DFSM_MESSAGE_STATE). 

inside dfsm_send_message_full_debug, before the first log statement we 
only check for dfsm or the message length/content being NULL/0, all of 
which can't really happen with that call path. also, in that case we'd 
return CS_ERR_INVALID_PARAM , which would bubble up into the delivery 
callback and cause us to leave CPG, which would again be visible in the 
logs.. 

but, just to make sure, could you reproduce the issue once more, and 
then (with debug symbols installed) run 

$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) 

on all nodes at the same time? this should minimize the fallout and show 
us whether the thread that logged the first part of sending the state is 
still around on the node that triggers the hang.. 

> Something new: where doing coredump or bt-full on pmxcfs on node1, 
> this have unlocked /etc/pve on other nodes 
> 
> /etc/pve unlocked(with coredump or bt-full): 17:57:40 

this looks like you bt-ed corosync this time around? if so, than this is 
expected: 

- attach gdb to corosync 
- corosync blocks 
- other nodes notice corosync is gone on node X 
- config change triggered 
- sync restarts on all nodes, does not trigger bug this time 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-28  9:35                                                                                                                             ` Alexandre DERUMIER
@ 2020-09-28 15:59                                                                                                                               ` Alexandre DERUMIER
  2020-09-29  5:30                                                                                                                                 ` Alexandre DERUMIER
  2020-09-29  8:51                                                                                                                                 ` Fabian Grünbichler
  0 siblings, 2 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-28 15:59 UTC (permalink / raw)
  To: Proxmox VE development discussion

Here a new test http://odisoweb1.odiso.net/test5

This has occured at corosync start


node1:
-----
start corosync : 17:30:19


node2: /etc/pve locked
--------------
Current time : 17:30:24


I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 

and a coredump of all nodes at same time with parallel ssh at 17:42:26


(Note that this time, /etc/pve was still locked after backtrace/coredump)




----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 28 Septembre 2020 11:35:00
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>but, just to make sure, could you reproduce the issue once more, and 
>>then (with debug symbols installed) run 
>> 
>>$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) 
>> 
>>on all nodes at the same time? this should minimize the fallout and show 
>>us whether the thread that logged the first part of sending the state is 
>>still around on the node that triggers the hang.. 

ok, no problem, I'll do a new test today 



>>this looks like you bt-ed corosync this time around? if so, than this is 
>>expected: 
>> 
>>- attach gdb to corosync 
>>- corosync blocks 
>>- other nodes notice corosync is gone on node X 
>>- config change triggered 
>>- sync restarts on all nodes, does not trigger bug this time 

ok, thanks, make sense. I didn't notice this previously, but this time it was with corosync started. 


----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Lundi 28 Septembre 2020 11:17:37 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote: 
> here a new hang: 
> 
> http://odisoweb1.odiso.net/test4/ 

okay, so at least we now know something strange inside pmxcfs is going 
on, and not inside corosync - we never reach the part where the broken 
node (again #13 in this hang!) sends out the state message: 

Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c:706:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's header type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message (type = 1, subtype = 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_sync_resources (dfsm.c:640:dfsm_release_sync_resources) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 00000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state) 
[...] 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_message_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full) 

this should be followed by output like this (from the -unblock log, 
where the sync went through): 

Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message debug (dfsm.c:241:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997161967616 (dfsm.c:242:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c:244:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfsm.c:244:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message loop body (dfsm.c:246:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message result: 1 (dfsm.c:250:dfsm_send_message_full_debug_state) 

but this never happens. the relevant code from our patched dfsm.c 
(reordered): 

static cs_error_t 
dfsm_send_state_message_full( 
dfsm_t *dfsm, 
uint16_t type, 
struct iovec *iov, 
unsigned int len) 
{ 
g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

// this message is still logged 
cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); 

// everything below this point might not have happened anymore 

dfsm_message_state_header_t header; 
header.base.type = type; 
header.base.subtype = 0; 
header.base.protocol_version = dfsm->protocol_version; 
header.base.time = time(NULL); 
header.base.reserved = 0; 

header.epoch = dfsm->sync_epoch; 

struct iovec real_iov[len + 1]; 

real_iov[0].iov_base = (char *)&header; 
real_iov[0].iov_len = sizeof(header); 

for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 
} 

static cs_error_t 
dfsm_send_message_full_debug_state( 
dfsm_t *dfsm, 
struct iovec *iov, 
unsigned int len, 
int retry) 
{ 
g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; 
cs_error_t result; 
int retries = 0; 

// this message is not visible in log 
// we don't know how far above this we managed to run 
cfs_dom_message(dfsm->log_domain, "send state message debug"); 
cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); 
for (int i = 0; i < len; i++) 
cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); 
loop: 
cfs_dom_message(dfsm->log_domain, "send state message loop body"); 

result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); 

cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); 
if (retry && result == CS_ERR_TRY_AGAIN) { 
nanosleep(&tvreq, NULL); 
++retries; 
if ((retries % 10) == 0) 
cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); 
if (retries < 100) 
goto loop; 
} 

if (retries) 
cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); 

if (result != CS_OK && 
(!retry || result != CS_ERR_TRY_AGAIN)) 
cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); 

return result; 
} 

I don't see much that could go wrong inside dfsm_send_state_message_full 
after the debug log statement (it's just filling out the header and iov 
structs, and calling 'dfsm_send_message_full_debug_state' since type == 
2 == DFSM_MESSAGE_STATE). 

inside dfsm_send_message_full_debug, before the first log statement we 
only check for dfsm or the message length/content being NULL/0, all of 
which can't really happen with that call path. also, in that case we'd 
return CS_ERR_INVALID_PARAM , which would bubble up into the delivery 
callback and cause us to leave CPG, which would again be visible in the 
logs.. 

but, just to make sure, could you reproduce the issue once more, and 
then (with debug symbols installed) run 

$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) 

on all nodes at the same time? this should minimize the fallout and show 
us whether the thread that logged the first part of sending the state is 
still around on the node that triggers the hang.. 

> Something new: where doing coredump or bt-full on pmxcfs on node1, 
> this have unlocked /etc/pve on other nodes 
> 
> /etc/pve unlocked(with coredump or bt-full): 17:57:40 

this looks like you bt-ed corosync this time around? if so, than this is 
expected: 

- attach gdb to corosync 
- corosync blocks 
- other nodes notice corosync is gone on node X 
- config change triggered 
- sync restarts on all nodes, does not trigger bug this time 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-28 15:59                                                                                                                               ` Alexandre DERUMIER
@ 2020-09-29  5:30                                                                                                                                 ` Alexandre DERUMIER
  2020-09-29  8:51                                                                                                                                 ` Fabian Grünbichler
  1 sibling, 0 replies; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-29  5:30 UTC (permalink / raw)
  To: Proxmox VE development discussion

also for test5,

I have restarted corosync on node1 at 17:54:05, this have unlocked /etc/pve on other nodes

I have submitted logs too : "corosync-restart-nodeX.log"

----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 28 Septembre 2020 17:59:20
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Here a new test http://odisoweb1.odiso.net/test5 

This has occured at corosync start 


node1: 
----- 
start corosync : 17:30:19 


node2: /etc/pve locked 
-------------- 
Current time : 17:30:24 


I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 

and a coredump of all nodes at same time with parallel ssh at 17:42:26 


(Note that this time, /etc/pve was still locked after backtrace/coredump) 




----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Lundi 28 Septembre 2020 11:35:00 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

>>but, just to make sure, could you reproduce the issue once more, and 
>>then (with debug symbols installed) run 
>> 
>>$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) 
>> 
>>on all nodes at the same time? this should minimize the fallout and show 
>>us whether the thread that logged the first part of sending the state is 
>>still around on the node that triggers the hang.. 

ok, no problem, I'll do a new test today 



>>this looks like you bt-ed corosync this time around? if so, than this is 
>>expected: 
>> 
>>- attach gdb to corosync 
>>- corosync blocks 
>>- other nodes notice corosync is gone on node X 
>>- config change triggered 
>>- sync restarts on all nodes, does not trigger bug this time 

ok, thanks, make sense. I didn't notice this previously, but this time it was with corosync started. 


----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Lundi 28 Septembre 2020 11:17:37 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote: 
> here a new hang: 
> 
> http://odisoweb1.odiso.net/test4/ 

okay, so at least we now know something strange inside pmxcfs is going 
on, and not inside corosync - we never reach the part where the broken 
node (again #13 in this hang!) sends out the state message: 

Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c:706:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's header type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message (type = 1, subtype = 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_sync_resources (dfsm.c:640:dfsm_release_sync_resources) 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 00000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state) 
[...] 
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_message_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full) 

this should be followed by output like this (from the -unblock log, 
where the sync went through): 

Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message debug (dfsm.c:241:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997161967616 (dfsm.c:242:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c:244:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfsm.c:244:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message loop body (dfsm.c:246:dfsm_send_message_full_debug_state) 
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message result: 1 (dfsm.c:250:dfsm_send_message_full_debug_state) 

but this never happens. the relevant code from our patched dfsm.c 
(reordered): 

static cs_error_t 
dfsm_send_state_message_full( 
dfsm_t *dfsm, 
uint16_t type, 
struct iovec *iov, 
unsigned int len) 
{ 
g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

// this message is still logged 
cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); 

// everything below this point might not have happened anymore 

dfsm_message_state_header_t header; 
header.base.type = type; 
header.base.subtype = 0; 
header.base.protocol_version = dfsm->protocol_version; 
header.base.time = time(NULL); 
header.base.reserved = 0; 

header.epoch = dfsm->sync_epoch; 

struct iovec real_iov[len + 1]; 

real_iov[0].iov_base = (char *)&header; 
real_iov[0].iov_len = sizeof(header); 

for (int i = 0; i < len; i++) 
real_iov[i + 1] = iov[i]; 

return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); 
} 

static cs_error_t 
dfsm_send_message_full_debug_state( 
dfsm_t *dfsm, 
struct iovec *iov, 
unsigned int len, 
int retry) 
{ 
g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); 
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); 

struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; 
cs_error_t result; 
int retries = 0; 

// this message is not visible in log 
// we don't know how far above this we managed to run 
cfs_dom_message(dfsm->log_domain, "send state message debug"); 
cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); 
for (int i = 0; i < len; i++) 
cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); 
loop: 
cfs_dom_message(dfsm->log_domain, "send state message loop body"); 

result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); 

cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); 
if (retry && result == CS_ERR_TRY_AGAIN) { 
nanosleep(&tvreq, NULL); 
++retries; 
if ((retries % 10) == 0) 
cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); 
if (retries < 100) 
goto loop; 
} 

if (retries) 
cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); 

if (result != CS_OK && 
(!retry || result != CS_ERR_TRY_AGAIN)) 
cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); 

return result; 
} 

I don't see much that could go wrong inside dfsm_send_state_message_full 
after the debug log statement (it's just filling out the header and iov 
structs, and calling 'dfsm_send_message_full_debug_state' since type == 
2 == DFSM_MESSAGE_STATE). 

inside dfsm_send_message_full_debug, before the first log statement we 
only check for dfsm or the message length/content being NULL/0, all of 
which can't really happen with that call path. also, in that case we'd 
return CS_ERR_INVALID_PARAM , which would bubble up into the delivery 
callback and cause us to leave CPG, which would again be visible in the 
logs.. 

but, just to make sure, could you reproduce the issue once more, and 
then (with debug symbols installed) run 

$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) 

on all nodes at the same time? this should minimize the fallout and show 
us whether the thread that logged the first part of sending the state is 
still around on the node that triggers the hang.. 

> Something new: where doing coredump or bt-full on pmxcfs on node1, 
> this have unlocked /etc/pve on other nodes 
> 
> /etc/pve unlocked(with coredump or bt-full): 17:57:40 

this looks like you bt-ed corosync this time around? if so, than this is 
expected: 

- attach gdb to corosync 
- corosync blocks 
- other nodes notice corosync is gone on node X 
- config change triggered 
- sync restarts on all nodes, does not trigger bug this time 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-28 15:59                                                                                                                               ` Alexandre DERUMIER
  2020-09-29  5:30                                                                                                                                 ` Alexandre DERUMIER
@ 2020-09-29  8:51                                                                                                                                 ` Fabian Grünbichler
  2020-09-29  9:37                                                                                                                                   ` Alexandre DERUMIER
  1 sibling, 1 reply; 84+ messages in thread
From: Fabian Grünbichler @ 2020-09-29  8:51 UTC (permalink / raw)
  To: Proxmox VE development discussion

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote:
> Here a new test http://odisoweb1.odiso.net/test5
> 
> This has occured at corosync start
> 
> 
> node1:
> -----
> start corosync : 17:30:19
> 
> 
> node2: /etc/pve locked
> --------------
> Current time : 17:30:24
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump)

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense.

I rebuilt the packages:

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa  pve-cluster_6.1-8_amd64.deb
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7  pve-cluster-dbgsym_6.1-8_amd64.deb

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages.

is there anything special about node 13? network topology, slower 
hardware, ... ?




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-29  8:51                                                                                                                                 ` Fabian Grünbichler
@ 2020-09-29  9:37                                                                                                                                   ` Alexandre DERUMIER
  2020-09-29 10:52                                                                                                                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-29  9:37 UTC (permalink / raw)
  To: Proxmox VE development discussion

>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon

>>is there anything special about node 13? network topology, slower
>>hardware, ... ?

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk.

this node is around 10% cpu usage, load is around 5.

----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 29 Septembre 2020 10:51:32
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> ----- 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -------------- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-29  9:37                                                                                                                                   ` Alexandre DERUMIER
@ 2020-09-29 10:52                                                                                                                                     ` Alexandre DERUMIER
  2020-09-29 11:43                                                                                                                                       ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-29 10:52 UTC (permalink / raw)
  To: Proxmox VE development discussion

here a new test:

http://odisoweb1.odiso.net/test6/

node1
-----
start corosync : 12:08:33


node2 (/etc/pve lock)
-----
Current time : 12:08:39


node1 (stop corosync : unlock /etc/pve)
-----
12:28:11 : systemctl stop corosync


backtraces: 12:26:30


coredump : 12:27:21


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 29 Septembre 2020 11:37:41
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon 

>>is there anything special about node 13? network topology, slower 
>>hardware, ... ? 

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk. 

this node is around 10% cpu usage, load is around 5. 

----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 10:51:32 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> ----- 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -------------- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-29 10:52                                                                                                                                     ` Alexandre DERUMIER
@ 2020-09-29 11:43                                                                                                                                       ` Alexandre DERUMIER
  2020-09-29 11:50                                                                                                                                         ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-29 11:43 UTC (permalink / raw)
  To: Proxmox VE development discussion

>>
>>node1 (stop corosync : unlock /etc/pve)
>>-----
>>12:28:11 : systemctl stop corosync

sorry, this was wrong,I need to start corosync after the stop to get it working again
I'll reupload theses logs


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 29 Septembre 2020 12:52:44
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

here a new test: 

http://odisoweb1.odiso.net/test6/ 

node1 
----- 
start corosync : 12:08:33 


node2 (/etc/pve lock) 
----- 
Current time : 12:08:39 


node1 (stop corosync : unlock /etc/pve) 
----- 
12:28:11 : systemctl stop corosync 


backtraces: 12:26:30 


coredump : 12:27:21 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 11:37:41 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon 

>>is there anything special about node 13? network topology, slower 
>>hardware, ... ? 

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk. 

this node is around 10% cpu usage, load is around 5. 

----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 10:51:32 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> ----- 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -------------- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-29 11:43                                                                                                                                       ` Alexandre DERUMIER
@ 2020-09-29 11:50                                                                                                                                         ` Alexandre DERUMIER
  2020-09-29 13:28                                                                                                                                           ` Fabian Grünbichler
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-29 11:50 UTC (permalink / raw)
  To: Proxmox VE development discussion

I have reuploaded the logs

node1
-----
start corosync : 12:08:33   (corosync.log)


node2 (/etc/pve lock)
-----
Current time : 12:08:39


node1 (stop corosync : ---> not unlocked)   (corosync-stop.log)
-----
12:28:11 : systemctl stop corosync

node2 (start corosync: ----> /etc/pve unlocked    (corosync-start.log)
------------------------------------------------

13:41:16 : systemctl start corosync


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 29 Septembre 2020 13:43:08
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>> 
>>node1 (stop corosync : unlock /etc/pve) 
>>----- 
>>12:28:11 : systemctl stop corosync 

sorry, this was wrong,I need to start corosync after the stop to get it working again 
I'll reupload theses logs 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 12:52:44 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

here a new test: 

http://odisoweb1.odiso.net/test6/ 

node1 
----- 
start corosync : 12:08:33 


node2 (/etc/pve lock) 
----- 
Current time : 12:08:39 


node1 (stop corosync : unlock /etc/pve) 
----- 
12:28:11 : systemctl stop corosync 


backtraces: 12:26:30 


coredump : 12:27:21 


----- Mail original ----- 
De: "aderumier" <aderumier@odiso.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 11:37:41 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon 

>>is there anything special about node 13? network topology, slower 
>>hardware, ... ? 

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk. 

this node is around 10% cpu usage, load is around 5. 

----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 10:51:32 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> ----- 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -------------- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-29 11:50                                                                                                                                         ` Alexandre DERUMIER
@ 2020-09-29 13:28                                                                                                                                           ` Fabian Grünbichler
  2020-09-29 13:52                                                                                                                                             ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Fabian Grünbichler @ 2020-09-29 13:28 UTC (permalink / raw)
  To: Proxmox VE development discussion

huge thanks for all the work on this btw!

I think I've found a likely culprit (a missing lock around a 
non-thread-safe corosync library call) based on the last logs (which 
were now finally complete!).

rebuilt packages with a proof-of-concept-fix:

23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7  pve-cluster_6.1-8_amd64.deb
9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef  pve-cluster-dbgsym_6.1-8_amd64.deb

I removed some logging statements which are no longer needed, so output 
is a bit less verbose again. if you are not able to trigger the issue 
with this package, feel free to remove the -debug and let it run for a 
little longer without the massive logs.

if feedback from your end is positive, I'll whip up a proper patch 
tomorrow or on Thursday.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-29 13:28                                                                                                                                           ` Fabian Grünbichler
@ 2020-09-29 13:52                                                                                                                                             ` Alexandre DERUMIER
  2020-09-30  6:09                                                                                                                                               ` Alexandre DERUMIER
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-29 13:52 UTC (permalink / raw)
  To: Proxmox VE development discussion


>>huge thanks for all the work on this btw! 

huge thanks to you ! ;)


>>I think I've found a likely culprit (a missing lock around a 
>>non-thread-safe corosync library call) based on the last logs (which 
>>were now finally complete!).

YES :)  


>>if feedback from your end is positive, I'll whip up a proper patch 
>>tomorrow or on Thursday. 

I'm going to launch a new test right now !


----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 29 Septembre 2020 15:28:19
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

huge thanks for all the work on this btw! 

I think I've found a likely culprit (a missing lock around a 
non-thread-safe corosync library call) based on the last logs (which 
were now finally complete!). 

rebuilt packages with a proof-of-concept-fix: 

23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7 pve-cluster_6.1-8_amd64.deb 
9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef pve-cluster-dbgsym_6.1-8_amd64.deb 

I removed some logging statements which are no longer needed, so output 
is a bit less verbose again. if you are not able to trigger the issue 
with this package, feel free to remove the -debug and let it run for a 
little longer without the massive logs. 

if feedback from your end is positive, I'll whip up a proper patch 
tomorrow or on Thursday. 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-29 13:52                                                                                                                                             ` Alexandre DERUMIER
@ 2020-09-30  6:09                                                                                                                                               ` Alexandre DERUMIER
  2020-09-30  6:26                                                                                                                                                 ` Thomas Lamprecht
  0 siblings, 1 reply; 84+ messages in thread
From: Alexandre DERUMIER @ 2020-09-30  6:09 UTC (permalink / raw)
  To: Proxmox VE development discussion

Hi,

some news, my last test is running for 14h now, and I don't have had any problem :)

So, it seem that is indeed fixed ! Congratulations !



I wonder if it could be related to this forum user
https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/

His problem is that after corosync lag (he's have 1 cluster stretch on 2DC with 10km distance, so I think sometimes he's having some small lag,
1 node is flooding other nodes with a lot of udp packets. (and making things worst, as corosync cpu is going to 100% / overloaded, and then can't see other onodes

I had this problem 6month ago after shutting down a node, that's why I'm thinking it could "maybe" related.

So, I wonder if it could be same pmxcfs bug, when something looping or send again again packets.

The forum user seem to have the problem multiple times in some week, so maybe he'll be able to test the new fixed pmxcs, and tell us if it's fixing this bug too.



----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 29 Septembre 2020 15:52:18
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>huge thanks for all the work on this btw! 

huge thanks to you ! ;) 


>>I think I've found a likely culprit (a missing lock around a 
>>non-thread-safe corosync library call) based on the last logs (which 
>>were now finally complete!). 

YES :) 


>>if feedback from your end is positive, I'll whip up a proper patch 
>>tomorrow or on Thursday. 

I'm going to launch a new test right now ! 


----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 15:28:19 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

huge thanks for all the work on this btw! 

I think I've found a likely culprit (a missing lock around a 
non-thread-safe corosync library call) based on the last logs (which 
were now finally complete!). 

rebuilt packages with a proof-of-concept-fix: 

23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7 pve-cluster_6.1-8_amd64.deb 
9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef pve-cluster-dbgsym_6.1-8_amd64.deb 

I removed some logging statements which are no longer needed, so output 
is a bit less verbose again. if you are not able to trigger the issue 
with this package, feel free to remove the -debug and let it run for a 
little longer without the massive logs. 

if feedback from your end is positive, I'll whip up a proper patch 
tomorrow or on Thursday. 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-30  6:09                                                                                                                                               ` Alexandre DERUMIER
@ 2020-09-30  6:26                                                                                                                                                 ` Thomas Lamprecht
  0 siblings, 0 replies; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-30  6:26 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER

Hi,

On 30.09.20 08:09, Alexandre DERUMIER wrote:
> some news, my last test is running for 14h now, and I don't have had any problem :)
> 

great! Thanks for all your testing time, this would have been much harder,
if even possible at all, without you probiving so much testing effort on a
production(!) cluster - appreciated!

Naturally many thanks to Fabian too, for reading so many logs without going
insane :-)

> So, it seem that is indeed fixed ! Congratulations !
> 

honza comfirmed Fabians suspicion about lacking guarantees of thread safety
for cpg_mcast_joined, which was sadly not documented, so this is surely
a bug, let's hope the last of such hard to reproduce ones.

> 
> 
> I wonder if it could be related to this forum user
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/
> 
> His problem is that after corosync lag (he's have 1 cluster stretch on 2DC with 10km distance, so I think sometimes he's having some small lag,
> 1 node is flooding other nodes with a lot of udp packets. (and making things worst, as corosync cpu is going to 100% / overloaded, and then can't see other onodes

I can imagine this problem showing up as a a side effect of a flood where partition
changes happen. Not so sure that this can be the cause of that directly.

> 
> I had this problem 6month ago after shutting down a node, that's why I'm thinking it could "maybe" related.
> 
> So, I wonder if it could be same pmxcfs bug, when something looping or send again again packets.
> 
> The forum user seem to have the problem multiple times in some week, so maybe he'll be able to test the new fixed pmxcs, and tell us if it's fixing this bug too.

Testing once available would be sure a good idea for them.





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-03 14:11 [pve-devel] corosync bug: cluster break after 1 node clean shutdown Alexandre DERUMIER
  2020-09-04 12:29 ` Alexandre DERUMIER
  2020-09-04 15:46 ` Alexandre DERUMIER
@ 2020-09-30 15:50 ` Thomas Lamprecht
  2020-10-15  9:16   ` Eneko Lacunza
  2 siblings, 1 reply; 84+ messages in thread
From: Thomas Lamprecht @ 2020-09-30 15:50 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre DERUMIER

Hi,

FYI: pve-cluster 6.2-1 is now available on pvetest, it includes the slightly modified
patch from Fabian.

cheers,
Thomas




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-30 15:50 ` Thomas Lamprecht
@ 2020-10-15  9:16   ` Eneko Lacunza
  0 siblings, 0 replies; 84+ messages in thread
From: Eneko Lacunza @ 2020-10-15  9:16 UTC (permalink / raw)
  To: pve-devel

Hi all,

I'm just a lurker on this list, but wanted to send a big THANK YOU to 
all involved in this fix, even with this 15-day lag. :-)

I think some of our clients have been affected by this on production 
clusters, so I hope this will improve their cluster stability.

Normally new features are the prettiest things, but having a robust and 
dependable system - that has no price!!

Thanks a lot!!
Eneko


El 30/9/20 a las 17:50, Thomas Lamprecht escribió:
> Hi,
>
> FYI: pve-cluster 6.2-1 is now available on pvetest, it includes the slightly modified
> patch from Fabian.
>
> cheers,
> Thomas
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>


-- 
Eneko Lacunza                | +34 943 569 206
                              | elacunza@binovo.es
Zuzendari teknikoa           | https://www.binovo.es
Director técnico             | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
  2020-09-04 12:29 ` Alexandre DERUMIER
  2020-09-04 15:42   ` Dietmar Maurer
@ 2020-12-29 14:21   ` Josef Johansson
  1 sibling, 0 replies; 84+ messages in thread
From: Josef Johansson @ 2020-12-29 14:21 UTC (permalink / raw)
  To: pve-devel

Hi,

On 9/4/20 2:29 PM, Alexandre DERUMIER wrote:
> BTW,
>
> do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ?
>
> I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..)

Using corosync to enable HA is a bit scary TBH, not sure how people
solve this, but I'd rather have the logic outside the cluster so it can
make a solid decision whether a node should be rebooted or not, and
choose to reboot via iDRAC if all checks fail. Maybe use the metrics
support to facilitate this?

> Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ?
>
>
>
> ----- Mail original -----
> De: "aderumier" <aderumier@odiso.com>
> À: "pve-devel" <pve-devel@pve.proxmox.com>
> Envoyé: Jeudi 3 Septembre 2020 16:11:56
> Objet: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
>
> Hi, 
>
> I had a problem this morning with corosync, after shutting down cleanly a node from the cluster. 
> This is a 14 nodes cluster, node2 was shutted. (with "halt" command) 
>
> HA was actived, and all nodes have been rebooted. 
> (I have already see this problem some months ago, without HA enabled, and it was like half of the nodes didn't see others nodes, like 2 cluster formations) 
>
> Some users have reported similar problems when adding a new node 
> https://forum.proxmox.com/threads/quorum-lost-when-adding-new-7th-node.75197/ 
>
>
> Here the full logs of the nodes, for each node I'm seeing quorum with 13 nodes. (so it's seem to be ok). 
> I didn't have time to look live on server, as HA have restarted them. 
> So, I don't known if it's a corosync bug or something related to crm,lrm,pmxcs (but I don't see any special logs) 
>
> Only node7 have survived a little bit longer,and I had stop lrm/crm to avoid reboot. 
>
> libknet1: 1.15-pve1 
> corosync: 3.0.3-pve1 
>
>
> Any ideas before submiting bug to corosync mailing ? (I'm seeing new libknet && corosync version on their github, I don't have read the changelog yet) 
>
>
>
>
> node1 
> ----- 
> Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm1 corosync[3678]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm1 corosync[3678]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm1 corosync[3678]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [dcdb] notice: cpg_send_message retried 1 times 
> Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: all data is up to date 
> Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 has no active links 
>
> --> reboot 
>
> node2 : shutdown log 
> ----- 
> Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd received signal 15: Terminated 
> Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302Y9120CGN.ata.state 
> Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302MG120CGN.ata.state 
> Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd is exiting (exit status 0) 
> Sep 3 10:39:05 m6kvm2 nrpe[20095]: Caught SIGTERM - shutting down... 
> Sep 3 10:39:05 m6kvm2 nrpe[20095]: Daemon shutdown 
> Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> starting task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: 
> Sep 3 10:39:07 m6kvm2 pve-guests[38552]: all VMs and CTs stopped 
> Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> end task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: OK 
> Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: received signal TERM 
> Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server closing 
> Sep 3 10:39:07 m6kvm2 spiceproxy[36572]: worker exit 
> Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: worker 36572 finished 
> Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server stopped 
> Sep 3 10:39:07 m6kvm2 pvestatd[3786]: received signal TERM 
> Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server closing 
> Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server stopped 
> Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: received signal TERM 
> Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: got shutdown request with shutdown policy 'conditional' 
> Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: shutdown LRM, stop all services 
> Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: received terminate request (signal) 
> Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: stopping pvefw logger 
> Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: watchdog closed (disabled) 
> Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: server stopped 
> Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: received signal TERM 
> Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: server received shutdown request 
> Sep 3 10:39:10 m6kvm2 pveproxy[24735]: received signal TERM 
> Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server closing 
> Sep 3 10:39:10 m6kvm2 pveproxy[29731]: worker exit 
> Sep 3 10:39:10 m6kvm2 pveproxy[30873]: worker exit 
> Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 30873 finished 
> Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 31391 finished 
> Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 29731 finished 
> Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server stopped 
> Sep 3 10:39:14 m6kvm2 pve-ha-crm[12847]: server stopped 
>
>
> node3 
> ----- 
> Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm3 corosync[30580]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm3 corosync[30580]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm3 corosync[30580]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: cpg_send_message retried 1 times 
> Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: all data is up to date 
> Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 has no active links 
>
>
> node4 
> ----- 
> Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm4 corosync[4085]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: all data is up to date 
> Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:20 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log 
>
> node5 
> ----- 
> Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm5 corosync[41830]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: all data is up to date 
> Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log 
>
>
>
>
>
> node6 
> ----- 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm6 corosync[36694]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 has no active links 
>
>
>
> node7 
> ----- 
> Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date 
> Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:20 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> ---> here the others nodes reboot almost at the same time 
> Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 3 link: 0 is down 
> Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 14 link: 0 is down 
> Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 has no active links 
> Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 has no active links 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 13 link: 0 is down 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 8 link: 0 is down 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 6 link: 0 is down 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 10 link: 0 is down 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 has no active links 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 has no active links 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 has no active links 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 has no active links 
> Sep 3 10:40:28 m6kvm7 corosync[15467]: [TOTEM ] Token has not been received in 4505 ms 
> Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 11 link: 0 is down 
> Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 9 link: 0 is down 
> Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 has no active links 
> Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 has no active links 
> Sep 3 10:40:30 m6kvm7 corosync[15467]: [TOTEM ] A processor failed, forming new configuration. 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 4 link: 0 is down 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 12 link: 0 is down 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 1 link: 0 is down 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 has no active links 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 has no active links 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 has no active links 
> Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] link: host: 5 link: 0 is down 
> Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1) 
> Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 has no active links 
> Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] A new membership (7.82f) was formed. Members left: 1 3 4 5 6 8 9 10 11 12 13 14 
> Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6 8 9 10 11 12 13 14 
> Sep 3 10:40:41 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 12 received 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 
> Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] This node is within the non-primary component and will NOT provide any services. 
> Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] Members[1]: 7 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
> Sep 3 10:40:41 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: dfsm_deliver_queue: queue length 51 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: node lost quorum 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: members: 7/15892 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: received write while not quorate - trigger resync 
> Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: leaving CPG group 
> Sep 3 10:40:41 m6kvm7 pve-ha-crm[16196]: node 'm6kvm2': state changed from 'online' => 'unknown' 
> Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] notice: start cluster connection 
> Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: cpg_join failed: 14 
> Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: can't initialize service 
> Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 
> Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
> Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! 
> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 
> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied 
> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied 
> Sep 3 10:40:56 m6kvm7 pve-ha-lrm[16140]: status change active => lost_agent_lock 
> Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change master => lost_manager_lock 
> Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: watchdog closed (disabled) 
> Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change lost_manager_lock => wait_for_quorum 
> Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] rx: host: 4 link: 0 is up 
> Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) 
> Sep 3 10:43:24 m6kvm7 corosync[15467]: [TOTEM ] A new membership (4.834) was formed. Members joined: 4 
> Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 4/3965, 7/15892 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: members: 4/3965, 7/15892 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation 
> Sep 3 10:43:24 m6kvm7 corosync[15467]: [QUORUM] Members[2]: 4 7 
> Sep 3 10:43:24 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 4/3965/00000002) 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 4/3965/00000002) 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 7/15892 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 7/15892 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: start sending inode updates 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: sent all (4) updates 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received all states 
> Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date 
> Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] rx: host: 3 link: 0 is up 
> Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) 
> Sep 3 10:44:05 m6kvm7 corosync[15467]: [TOTEM ] A new membership (3.838) was formed. Members joined: 3 
> Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 3/3613, 4/3965, 7/15892 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: members: 3/3613, 4/3965, 7/15892 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation 
> Sep 3 10:44:05 m6kvm7 corosync[15467]: [QUORUM] Members[3]: 3 4 7 
> Sep 3 10:44:05 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 3/3613/00000002) 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 3/3613/00000002) 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 4/3965 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 4/3965, 7/15892 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received all states 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date 
> Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: dfsm_deliver_queue: queue length 1 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] rx: host: 1 link: 0 is up 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.83c) was formed. Members joined: 1 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received 
> Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 
> Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 
> Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [QUORUM] Members[4]: 1 3 4 7 
> Sep 3 10:44:31 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3552/00000002) 
> Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3552/00000002) 
> Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states 
> Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 3/3613 
> Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 3/3613, 4/3965, 7/15892 
> Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date 
> Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received all states 
> Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date 
> Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log 
> Sep 3 10:49:18 m6kvm7 pmxcfs[15892]: [status] notice: received log 
>
>
>
>
> node8 
> ----- 
>
> Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm8 corosync[24361]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: all data is up to date 
> Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log 
>
>
> --> reboot 
>
>
>
> node9 
> ----- 
> Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm9 corosync[22340]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: all data is up to date 
> Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:20 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log 
>
>
> --> reboot 
>
>
>
> node10 
> ------ 
> Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm10 corosync[41458]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: all data is up to date 
> Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log 
>
> --> reboot 
>
>
> node11 
> ------ 
> Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm11 corosync[12455]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: all data is up to date 
> Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log 
> Sep 3 10:40:20 m6kvm11 pmxcfs[12983]: [status] notice: received log 
>
>
>
>
> node12 
> ------ 
> Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm12 corosync[43716]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [dcdb] notice: cpg_send_message retried 1 times 
> Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: all data is up to date 
> Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:20 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
> Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log 
>
>
>
>
>
> node13 
> ------ 
> Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm13 corosync[39182]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: cpg_send_message retried 1 times 
> Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: all data is up to date 
> Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 has no active links 
> Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log 
> Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log 
> --> reboot 
>
> node14 
> ------ 
> Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 
> Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: starting data syncronisation 
> Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: received sync request (epoch 1/3527/00000002) 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 
> Sep 3 10:39:16 m6kvm14 corosync[42413]: [MAIN ] Completed service synchronization, ready to provide service. 
> Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: received all states 
> Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: all data is up to date 
> Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] link: host: 2 link: 0 is down 
> Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) 
> Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 has no active links 
>
> --> reboot 
>
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel@lists.proxmox.com 
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

-- 
Med vänliga hälsningar
Josef Johansson




^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2020-12-29 14:51 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 14:11 [pve-devel] corosync bug: cluster break after 1 node clean shutdown Alexandre DERUMIER
2020-09-04 12:29 ` Alexandre DERUMIER
2020-09-04 15:42   ` Dietmar Maurer
2020-09-05 13:32     ` Alexandre DERUMIER
2020-09-05 15:23       ` dietmar
2020-09-05 17:30         ` Alexandre DERUMIER
2020-09-06  4:21           ` dietmar
2020-09-06  5:36             ` Alexandre DERUMIER
2020-09-06  6:33               ` Alexandre DERUMIER
2020-09-06  8:43               ` Alexandre DERUMIER
2020-09-06 12:14                 ` dietmar
2020-09-06 12:19                   ` dietmar
2020-09-07  7:00                     ` Thomas Lamprecht
2020-09-07  7:19                   ` Alexandre DERUMIER
2020-09-07  8:18                     ` dietmar
2020-09-07  9:32                       ` Alexandre DERUMIER
2020-09-07 13:23                         ` Alexandre DERUMIER
2020-09-08  4:41                           ` dietmar
2020-09-08  7:11                             ` Alexandre DERUMIER
2020-09-09 20:05                               ` Thomas Lamprecht
2020-09-10  4:58                                 ` Alexandre DERUMIER
2020-09-10  8:21                                   ` Thomas Lamprecht
2020-09-10 11:34                                     ` Alexandre DERUMIER
2020-09-10 18:21                                       ` Thomas Lamprecht
2020-09-14  4:54                                         ` Alexandre DERUMIER
2020-09-14  7:14                                           ` Dietmar Maurer
2020-09-14  8:27                                             ` Alexandre DERUMIER
2020-09-14  8:51                                               ` Thomas Lamprecht
2020-09-14 15:45                                                 ` Alexandre DERUMIER
2020-09-15  5:45                                                   ` dietmar
2020-09-15  6:27                                                     ` Alexandre DERUMIER
2020-09-15  7:13                                                       ` dietmar
2020-09-15  8:42                                                         ` Alexandre DERUMIER
2020-09-15  9:35                                                           ` Alexandre DERUMIER
2020-09-15  9:46                                                             ` Thomas Lamprecht
2020-09-15 10:15                                                               ` Alexandre DERUMIER
2020-09-15 11:04                                                                 ` Alexandre DERUMIER
2020-09-15 12:49                                                                   ` Alexandre DERUMIER
2020-09-15 13:00                                                                     ` Thomas Lamprecht
2020-09-15 14:09                                                                       ` Alexandre DERUMIER
2020-09-15 14:19                                                                         ` Alexandre DERUMIER
2020-09-15 14:32                                                                         ` Thomas Lamprecht
2020-09-15 14:57                                                                           ` Alexandre DERUMIER
2020-09-15 15:58                                                                             ` Alexandre DERUMIER
2020-09-16  7:34                                                                               ` Alexandre DERUMIER
2020-09-16  7:58                                                                                 ` Alexandre DERUMIER
2020-09-16  8:30                                                                                   ` Alexandre DERUMIER
2020-09-16  8:53                                                                                     ` Alexandre DERUMIER
     [not found]                                                                                     ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com>
2020-09-16 13:15                                                                                       ` Alexandre DERUMIER
2020-09-16 14:45                                                                                         ` Thomas Lamprecht
2020-09-16 15:17                                                                                           ` Alexandre DERUMIER
2020-09-17  9:21                                                                                             ` Fabian Grünbichler
2020-09-17  9:59                                                                                               ` Alexandre DERUMIER
2020-09-17 10:02                                                                                                 ` Alexandre DERUMIER
2020-09-17 11:35                                                                                                   ` Thomas Lamprecht
2020-09-20 23:54                                                                                                     ` Alexandre DERUMIER
2020-09-22  5:43                                                                                                       ` Alexandre DERUMIER
2020-09-24 14:02                                                                                                         ` Fabian Grünbichler
2020-09-24 14:29                                                                                                           ` Alexandre DERUMIER
2020-09-24 18:07                                                                                                             ` Alexandre DERUMIER
2020-09-25  6:44                                                                                                               ` Alexandre DERUMIER
2020-09-25  7:15                                                                                                                 ` Alexandre DERUMIER
2020-09-25  9:19                                                                                                                   ` Fabian Grünbichler
2020-09-25  9:46                                                                                                                     ` Alexandre DERUMIER
2020-09-25 12:51                                                                                                                       ` Fabian Grünbichler
2020-09-25 16:29                                                                                                                         ` Alexandre DERUMIER
2020-09-28  9:17                                                                                                                           ` Fabian Grünbichler
2020-09-28  9:35                                                                                                                             ` Alexandre DERUMIER
2020-09-28 15:59                                                                                                                               ` Alexandre DERUMIER
2020-09-29  5:30                                                                                                                                 ` Alexandre DERUMIER
2020-09-29  8:51                                                                                                                                 ` Fabian Grünbichler
2020-09-29  9:37                                                                                                                                   ` Alexandre DERUMIER
2020-09-29 10:52                                                                                                                                     ` Alexandre DERUMIER
2020-09-29 11:43                                                                                                                                       ` Alexandre DERUMIER
2020-09-29 11:50                                                                                                                                         ` Alexandre DERUMIER
2020-09-29 13:28                                                                                                                                           ` Fabian Grünbichler
2020-09-29 13:52                                                                                                                                             ` Alexandre DERUMIER
2020-09-30  6:09                                                                                                                                               ` Alexandre DERUMIER
2020-09-30  6:26                                                                                                                                                 ` Thomas Lamprecht
2020-09-15  7:58                                                       ` Thomas Lamprecht
2020-12-29 14:21   ` Josef Johansson
2020-09-04 15:46 ` Alexandre DERUMIER
2020-09-30 15:50 ` Thomas Lamprecht
2020-10-15  9:16   ` Eneko Lacunza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal