* [pve-devel] corosync bug: cluster break after 1 node clean shutdown @ 2020-09-03 14:11 Alexandre DERUMIER 2020-09-04 12:29 ` Alexandre DERUMIER ` (2 more replies) 0 siblings, 3 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-03 14:11 UTC (permalink / raw) To: pve-devel Hi, I had a problem this morning with corosync, after shutting down cleanly a node from the cluster. This is a 14 nodes cluster, node2 was shutted. (with "halt" command) HA was actived, and all nodes have been rebooted. (I have already see this problem some months ago, without HA enabled, and it was like half of the nodes didn't see others nodes, like 2 cluster formations) Some users have reported similar problems when adding a new node https://forum.proxmox.com/threads/quorum-lost-when-adding-new-7th-node.75197/ Here the full logs of the nodes, for each node I'm seeing quorum with 13 nodes. (so it's seem to be ok). I didn't have time to look live on server, as HA have restarted them. So, I don't known if it's a corosync bug or something related to crm,lrm,pmxcs (but I don't see any special logs) Only node7 have survived a little bit longer,and I had stop lrm/crm to avoid reboot. libknet1: 1.15-pve1 corosync: 3.0.3-pve1 Any ideas before submiting bug to corosync mailing ? (I'm seeing new libknet && corosync version on their github, I don't have read the changelog yet) node1 ----- Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm1 corosync[3678]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm1 corosync[3678]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm1 corosync[3678]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [dcdb] notice: cpg_send_message retried 1 times Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: received all states Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 has no active links --> reboot node2 : shutdown log ----- Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd received signal 15: Terminated Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302Y9120CGN.ata.state Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302MG120CGN.ata.state Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd is exiting (exit status 0) Sep 3 10:39:05 m6kvm2 nrpe[20095]: Caught SIGTERM - shutting down... Sep 3 10:39:05 m6kvm2 nrpe[20095]: Daemon shutdown Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> starting task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: Sep 3 10:39:07 m6kvm2 pve-guests[38552]: all VMs and CTs stopped Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> end task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: OK Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: received signal TERM Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server closing Sep 3 10:39:07 m6kvm2 spiceproxy[36572]: worker exit Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: worker 36572 finished Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server stopped Sep 3 10:39:07 m6kvm2 pvestatd[3786]: received signal TERM Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server closing Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server stopped Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: received signal TERM Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: got shutdown request with shutdown policy 'conditional' Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: shutdown LRM, stop all services Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: received terminate request (signal) Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: stopping pvefw logger Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: watchdog closed (disabled) Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: server stopped Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: received signal TERM Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: server received shutdown request Sep 3 10:39:10 m6kvm2 pveproxy[24735]: received signal TERM Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server closing Sep 3 10:39:10 m6kvm2 pveproxy[29731]: worker exit Sep 3 10:39:10 m6kvm2 pveproxy[30873]: worker exit Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 30873 finished Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 31391 finished Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 29731 finished Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server stopped Sep 3 10:39:14 m6kvm2 pve-ha-crm[12847]: server stopped node3 ----- Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm3 corosync[30580]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm3 corosync[30580]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm3 corosync[30580]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: cpg_send_message retried 1 times Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: received all states Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 has no active links node4 ----- Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm4 corosync[4085]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm4 corosync[4085]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: received all states Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: all data is up to date Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:20 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log node5 ----- Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm5 corosync[41830]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm5 corosync[41830]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: received all states Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log node6 ----- Sep 3 10:39:16 m6kvm6 corosync[36694]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm6 corosync[36694]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 has no active links node7 ----- Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: received all states Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:20 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log ---> here the others nodes reboot almost at the same time Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 3 link: 0 is down Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 14 link: 0 is down Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 has no active links Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 (passive) best link: 0 (pri: 1) Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 has no active links Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 13 link: 0 is down Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 8 link: 0 is down Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 6 link: 0 is down Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 10 link: 0 is down Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1) Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 has no active links Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 has no active links Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1) Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 has no active links Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1) Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 has no active links Sep 3 10:40:28 m6kvm7 corosync[15467]: [TOTEM ] Token has not been received in 4505 ms Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 11 link: 0 is down Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 9 link: 0 is down Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1) Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 has no active links Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1) Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 has no active links Sep 3 10:40:30 m6kvm7 corosync[15467]: [TOTEM ] A processor failed, forming new configuration. Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 4 link: 0 is down Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 12 link: 0 is down Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 1 link: 0 is down Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 has no active links Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 (passive) best link: 0 (pri: 1) Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 has no active links Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 has no active links Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] link: host: 5 link: 0 is down Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1) Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 has no active links Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] A new membership (7.82f) was formed. Members left: 1 3 4 5 6 8 9 10 11 12 13 14 Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6 8 9 10 11 12 13 14 Sep 3 10:40:41 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 12 received Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] Members[1]: 7 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:40:41 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: dfsm_deliver_queue: queue length 51 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: node lost quorum Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: members: 7/15892 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: received write while not quorate - trigger resync Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: leaving CPG group Sep 3 10:40:41 m6kvm7 pve-ha-crm[16196]: node 'm6kvm2': state changed from 'online' => 'unknown' Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] notice: start cluster connection Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: cpg_join failed: 14 Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: can't initialize service Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied Sep 3 10:40:56 m6kvm7 pve-ha-lrm[16140]: status change active => lost_agent_lock Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change master => lost_manager_lock Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: watchdog closed (disabled) Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change lost_manager_lock => wait_for_quorum Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] rx: host: 4 link: 0 is up Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) Sep 3 10:43:24 m6kvm7 corosync[15467]: [TOTEM ] A new membership (4.834) was formed. Members joined: 4 Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 4/3965, 7/15892 Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: members: 4/3965, 7/15892 Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation Sep 3 10:43:24 m6kvm7 corosync[15467]: [QUORUM] Members[2]: 4 7 Sep 3 10:43:24 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 4/3965/00000002) Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 4/3965/00000002) Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 7/15892 Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 7/15892 Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: start sending inode updates Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: sent all (4) updates Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received all states Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] rx: host: 3 link: 0 is up Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Sep 3 10:44:05 m6kvm7 corosync[15467]: [TOTEM ] A new membership (3.838) was formed. Members joined: 3 Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 3/3613, 4/3965, 7/15892 Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: members: 3/3613, 4/3965, 7/15892 Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation Sep 3 10:44:05 m6kvm7 corosync[15467]: [QUORUM] Members[3]: 3 4 7 Sep 3 10:44:05 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 3/3613/00000002) Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 3/3613/00000002) Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 4/3965 Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 4/3965, 7/15892 Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received all states Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: dfsm_deliver_queue: queue length 1 Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] rx: host: 1 link: 0 is up Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Sep 3 10:44:31 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.83c) was formed. Members joined: 1 Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation Sep 3 10:44:31 m6kvm7 corosync[15467]: [QUORUM] Members[4]: 1 3 4 7 Sep 3 10:44:31 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3552/00000002) Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3552/00000002) Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 3/3613 Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 3/3613, 4/3965, 7/15892 Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received all states Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:49:18 m6kvm7 pmxcfs[15892]: [status] notice: received log node8 ----- Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm8 corosync[24361]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm8 corosync[24361]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: received all states Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log --> reboot node9 ----- Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm9 corosync[22340]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm9 corosync[22340]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: received all states Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: all data is up to date Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:20 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log --> reboot node10 ------ Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm10 corosync[41458]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm10 corosync[41458]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: received all states Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: all data is up to date Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log --> reboot node11 ------ Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm11 corosync[12455]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm11 corosync[12455]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: received all states Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log Sep 3 10:40:20 m6kvm11 pmxcfs[12983]: [status] notice: received log node12 ------ Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm12 corosync[43716]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm12 corosync[43716]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [dcdb] notice: cpg_send_message retried 1 times Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: received all states Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:20 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log node13 ------ Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm13 corosync[39182]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm13 corosync[39182]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: cpg_send_message retried 1 times Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: received all states Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: all data is up to date Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log --> reboot node14 ------ Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm14 corosync[42413]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm14 corosync[42413]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: received all states Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 has no active links --> reboot ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-03 14:11 [pve-devel] corosync bug: cluster break after 1 node clean shutdown Alexandre DERUMIER @ 2020-09-04 12:29 ` Alexandre DERUMIER 2020-09-04 15:42 ` Dietmar Maurer 2020-12-29 14:21 ` Josef Johansson 2020-09-04 15:46 ` Alexandre DERUMIER 2020-09-30 15:50 ` Thomas Lamprecht 2 siblings, 2 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-04 12:29 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: pve-devel BTW, do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ? I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..) Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Jeudi 3 Septembre 2020 16:11:56 Objet: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Hi, I had a problem this morning with corosync, after shutting down cleanly a node from the cluster. This is a 14 nodes cluster, node2 was shutted. (with "halt" command) HA was actived, and all nodes have been rebooted. (I have already see this problem some months ago, without HA enabled, and it was like half of the nodes didn't see others nodes, like 2 cluster formations) Some users have reported similar problems when adding a new node https://forum.proxmox.com/threads/quorum-lost-when-adding-new-7th-node.75197/ Here the full logs of the nodes, for each node I'm seeing quorum with 13 nodes. (so it's seem to be ok). I didn't have time to look live on server, as HA have restarted them. So, I don't known if it's a corosync bug or something related to crm,lrm,pmxcs (but I don't see any special logs) Only node7 have survived a little bit longer,and I had stop lrm/crm to avoid reboot. libknet1: 1.15-pve1 corosync: 3.0.3-pve1 Any ideas before submiting bug to corosync mailing ? (I'm seeing new libknet && corosync version on their github, I don't have read the changelog yet) node1 ----- Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm1 corosync[3678]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm1 corosync[3678]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm1 corosync[3678]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [dcdb] notice: cpg_send_message retried 1 times Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: received all states Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 has no active links --> reboot node2 : shutdown log ----- Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd received signal 15: Terminated Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302Y9120CGN.ata.state Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302MG120CGN.ata.state Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd is exiting (exit status 0) Sep 3 10:39:05 m6kvm2 nrpe[20095]: Caught SIGTERM - shutting down... Sep 3 10:39:05 m6kvm2 nrpe[20095]: Daemon shutdown Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> starting task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: Sep 3 10:39:07 m6kvm2 pve-guests[38552]: all VMs and CTs stopped Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> end task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: OK Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: received signal TERM Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server closing Sep 3 10:39:07 m6kvm2 spiceproxy[36572]: worker exit Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: worker 36572 finished Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server stopped Sep 3 10:39:07 m6kvm2 pvestatd[3786]: received signal TERM Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server closing Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server stopped Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: received signal TERM Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: got shutdown request with shutdown policy 'conditional' Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: shutdown LRM, stop all services Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: received terminate request (signal) Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: stopping pvefw logger Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: watchdog closed (disabled) Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: server stopped Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: received signal TERM Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: server received shutdown request Sep 3 10:39:10 m6kvm2 pveproxy[24735]: received signal TERM Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server closing Sep 3 10:39:10 m6kvm2 pveproxy[29731]: worker exit Sep 3 10:39:10 m6kvm2 pveproxy[30873]: worker exit Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 30873 finished Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 31391 finished Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 29731 finished Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server stopped Sep 3 10:39:14 m6kvm2 pve-ha-crm[12847]: server stopped node3 ----- Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm3 corosync[30580]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm3 corosync[30580]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm3 corosync[30580]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: cpg_send_message retried 1 times Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: received all states Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 has no active links node4 ----- Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm4 corosync[4085]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm4 corosync[4085]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm4 corosync[4085]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: received all states Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: all data is up to date Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:20 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log node5 ----- Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm5 corosync[41830]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm5 corosync[41830]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm5 corosync[41830]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: received all states Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log node6 ----- Sep 3 10:39:16 m6kvm6 corosync[36694]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm6 corosync[36694]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm6 corosync[36694]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 has no active links node7 ----- Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm7 corosync[15467]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: received all states Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:20 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log ---> here the others nodes reboot almost at the same time Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 3 link: 0 is down Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 14 link: 0 is down Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 has no active links Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 (passive) best link: 0 (pri: 1) Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 has no active links Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 13 link: 0 is down Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 8 link: 0 is down Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 6 link: 0 is down Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 10 link: 0 is down Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1) Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 has no active links Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 has no active links Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1) Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 has no active links Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1) Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 has no active links Sep 3 10:40:28 m6kvm7 corosync[15467]: [TOTEM ] Token has not been received in 4505 ms Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 11 link: 0 is down Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 9 link: 0 is down Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1) Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 has no active links Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1) Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 has no active links Sep 3 10:40:30 m6kvm7 corosync[15467]: [TOTEM ] A processor failed, forming new configuration. Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 4 link: 0 is down Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 12 link: 0 is down Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 1 link: 0 is down Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 has no active links Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 (passive) best link: 0 (pri: 1) Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 has no active links Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 has no active links Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] link: host: 5 link: 0 is down Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1) Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 has no active links Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] A new membership (7.82f) was formed. Members left: 1 3 4 5 6 8 9 10 11 12 13 14 Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6 8 9 10 11 12 13 14 Sep 3 10:40:41 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 12 received Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] Members[1]: 7 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:40:41 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: dfsm_deliver_queue: queue length 51 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: node lost quorum Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: members: 7/15892 Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: received write while not quorate - trigger resync Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: leaving CPG group Sep 3 10:40:41 m6kvm7 pve-ha-crm[16196]: node 'm6kvm2': state changed from 'online' => 'unknown' Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] notice: start cluster connection Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: cpg_join failed: 14 Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: can't initialize service Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied Sep 3 10:40:56 m6kvm7 pve-ha-lrm[16140]: status change active => lost_agent_lock Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change master => lost_manager_lock Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: watchdog closed (disabled) Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change lost_manager_lock => wait_for_quorum Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] rx: host: 4 link: 0 is up Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) Sep 3 10:43:24 m6kvm7 corosync[15467]: [TOTEM ] A new membership (4.834) was formed. Members joined: 4 Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 4/3965, 7/15892 Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: members: 4/3965, 7/15892 Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation Sep 3 10:43:24 m6kvm7 corosync[15467]: [QUORUM] Members[2]: 4 7 Sep 3 10:43:24 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 4/3965/00000002) Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 4/3965/00000002) Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 7/15892 Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 7/15892 Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: start sending inode updates Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: sent all (4) updates Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received all states Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] rx: host: 3 link: 0 is up Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Sep 3 10:44:05 m6kvm7 corosync[15467]: [TOTEM ] A new membership (3.838) was formed. Members joined: 3 Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 3/3613, 4/3965, 7/15892 Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: members: 3/3613, 4/3965, 7/15892 Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation Sep 3 10:44:05 m6kvm7 corosync[15467]: [QUORUM] Members[3]: 3 4 7 Sep 3 10:44:05 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 3/3613/00000002) Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 3/3613/00000002) Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 4/3965 Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 4/3965, 7/15892 Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received all states Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: dfsm_deliver_queue: queue length 1 Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] rx: host: 1 link: 0 is up Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Sep 3 10:44:31 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.83c) was formed. Members joined: 1 Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation Sep 3 10:44:31 m6kvm7 corosync[15467]: [QUORUM] Members[4]: 1 3 4 7 Sep 3 10:44:31 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3552/00000002) Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3552/00000002) Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 3/3613 Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 3/3613, 4/3965, 7/15892 Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received all states Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log Sep 3 10:49:18 m6kvm7 pmxcfs[15892]: [status] notice: received log node8 ----- Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm8 corosync[24361]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm8 corosync[24361]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm8 corosync[24361]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: received all states Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log --> reboot node9 ----- Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm9 corosync[22340]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm9 corosync[22340]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm9 corosync[22340]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: received all states Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: all data is up to date Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:20 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log --> reboot node10 ------ Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm10 corosync[41458]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm10 corosync[41458]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm10 corosync[41458]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: received all states Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: all data is up to date Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log --> reboot node11 ------ Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm11 corosync[12455]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm11 corosync[12455]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm11 corosync[12455]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: received all states Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log Sep 3 10:40:20 m6kvm11 pmxcfs[12983]: [status] notice: received log node12 ------ Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm12 corosync[43716]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm12 corosync[43716]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm12 corosync[43716]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [dcdb] notice: cpg_send_message retried 1 times Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: received all states Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:20 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log node13 ------ Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm13 corosync[39182]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm13 corosync[39182]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm13 corosync[39182]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: cpg_send_message retried 1 times Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: received all states Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: all data is up to date Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 has no active links Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log --> reboot node14 ------ Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: starting data syncronisation Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: starting data syncronisation Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: received sync request (epoch 1/3527/00000002) Sep 3 10:39:16 m6kvm14 corosync[42413]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received Sep 3 10:39:16 m6kvm14 corosync[42413]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 Sep 3 10:39:16 m6kvm14 corosync[42413]: [MAIN ] Completed service synchronization, ready to provide service. Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: received all states Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: all data is up to date Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] link: host: 2 link: 0 is down Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 has no active links --> reboot _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-04 12:29 ` Alexandre DERUMIER @ 2020-09-04 15:42 ` Dietmar Maurer 2020-09-05 13:32 ` Alexandre DERUMIER 2020-12-29 14:21 ` Josef Johansson 1 sibling, 1 reply; 84+ messages in thread From: Dietmar Maurer @ 2020-09-04 15:42 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER; +Cc: pve-devel > do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ? I would try to find the bug instead. > I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..) > > Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? Was this even related to corosync? What exactly caused the reboot? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-04 15:42 ` Dietmar Maurer @ 2020-09-05 13:32 ` Alexandre DERUMIER 2020-09-05 15:23 ` dietmar 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-05 13:32 UTC (permalink / raw) To: dietmar; +Cc: Proxmox VE development discussion, pve-devel > Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? > >>Was this even related to corosync? What exactly caused the reboot? Hi Dietmar, what I'm 100% sure, it that the watchdog have reboot all the servers. (I have watchdog trace in ipmi) That's happen, just after shutdown the server. What is strange is that corosync logs on all servers show that they correctly the node down, and see other nodes. So, I really don't known. Maybe corosync what hanging ? I don't have any other logs from crm/lrm/pmxcs... I'm really blind. :/ ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com> Cc: "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Vendredi 4 Septembre 2020 17:42:45 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ? I would try to find the bug instead. > I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..) > > Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? Was this even related to corosync? What exactly caused the reboot? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-05 13:32 ` Alexandre DERUMIER @ 2020-09-05 15:23 ` dietmar 2020-09-05 17:30 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: dietmar @ 2020-09-05 15:23 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: Proxmox VE development discussion, pve-devel > what I'm 100% sure, it that the watchdog have reboot all the servers. (I have watchdog trace in ipmi) So you are using ipmi hardware watchdog? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-05 15:23 ` dietmar @ 2020-09-05 17:30 ` Alexandre DERUMIER 2020-09-06 4:21 ` dietmar 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-05 17:30 UTC (permalink / raw) To: dietmar; +Cc: Proxmox VE development discussion, pve-devel >>So you are using ipmi hardware watchdog? yes, I'm using dell idrac ipmi card watchdog. ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Samedi 5 Septembre 2020 17:23:28 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > what I'm 100% sure, it that the watchdog have reboot all the servers. (I have watchdog trace in ipmi) So you are using ipmi hardware watchdog? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-05 17:30 ` Alexandre DERUMIER @ 2020-09-06 4:21 ` dietmar 2020-09-06 5:36 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: dietmar @ 2020-09-06 4:21 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: Proxmox VE development discussion, pve-devel > >>So you are using ipmi hardware watchdog? > > yes, I'm using dell idrac ipmi card watchdog But the pve logs look ok, and there is no indication that we stopped updating the watchdog. So why did the watchdog trigger? Maybe an IPMI bug? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-06 4:21 ` dietmar @ 2020-09-06 5:36 ` Alexandre DERUMIER 2020-09-06 6:33 ` Alexandre DERUMIER 2020-09-06 8:43 ` Alexandre DERUMIER 0 siblings, 2 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-06 5:36 UTC (permalink / raw) To: dietmar; +Cc: Proxmox VE development discussion, pve-devel >>But the pve logs look ok, and there is no indication >>that we stopped updating the watchdog. So why did the >>watchdog trigger? Maybe an IPMI bug? do you mean an ipmi bug on all 13 servers at the same time ? (I also have 2 supermicro servers in this cluster, but they use same ipmi watchdog driver. (ipmi_watchdog) I had same kind of with bug once (when stopping a server), on another cluster, 6 months ago. This was without HA, but different version of corosync, and that time, I was really seeing quorum split in the corosync logs of the servers. I'll try to reproduce with a virtual cluster with 14 nodes (don't have enough hardware) Could I be a bug in proxmox HA code, where watchdog is not resetted by LRM anymore? ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Dimanche 6 Septembre 2020 06:21:55 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > >>So you are using ipmi hardware watchdog? > > yes, I'm using dell idrac ipmi card watchdog But the pve logs look ok, and there is no indication that we stopped updating the watchdog. So why did the watchdog trigger? Maybe an IPMI bug? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-06 5:36 ` Alexandre DERUMIER @ 2020-09-06 6:33 ` Alexandre DERUMIER 2020-09-06 8:43 ` Alexandre DERUMIER 1 sibling, 0 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-06 6:33 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: dietmar, pve-devel Also, I wonder if it could be possible to not use watchdog fencing at all (as option), if cluster use only shared storages with native disk lock/reservation. Like ceph rbd for example, with exclusive-lock, you can't write from 2 clients on same rbd, so ha will not be able to start qemu on another node. ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "dietmar" <dietmar@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Dimanche 6 Septembre 2020 07:36:10 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>But the pve logs look ok, and there is no indication >>that we stopped updating the watchdog. So why did the >>watchdog trigger? Maybe an IPMI bug? do you mean an ipmi bug on all 13 servers at the same time ? (I also have 2 supermicro servers in this cluster, but they use same ipmi watchdog driver. (ipmi_watchdog) I had same kind of with bug once (when stopping a server), on another cluster, 6 months ago. This was without HA, but different version of corosync, and that time, I was really seeing quorum split in the corosync logs of the servers. I'll try to reproduce with a virtual cluster with 14 nodes (don't have enough hardware) Could I be a bug in proxmox HA code, where watchdog is not resetted by LRM anymore? ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Dimanche 6 Septembre 2020 06:21:55 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > >>So you are using ipmi hardware watchdog? > > yes, I'm using dell idrac ipmi card watchdog But the pve logs look ok, and there is no indication that we stopped updating the watchdog. So why did the watchdog trigger? Maybe an IPMI bug? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-06 5:36 ` Alexandre DERUMIER 2020-09-06 6:33 ` Alexandre DERUMIER @ 2020-09-06 8:43 ` Alexandre DERUMIER 2020-09-06 12:14 ` dietmar 1 sibling, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-06 8:43 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: dietmar, pve-devel Maybe something interesting, the only survived node was node7, and it was the crm master I'm also seein crm disabling watchdog, and also some "loop take too long" messages (some migration logs from node2 to node1 before the maintenance) Sep 3 10:36:29 m6kvm7 pve-ha-crm[16196]: service 'vm:992': state changed from 'migrate' to 'started' (node = m6kvm1) Sep 3 10:36:29 m6kvm7 pve-ha-crm[16196]: service 'vm:993': state changed from 'migrate' to 'started' (node = m6kvm1) Sep 3 10:36:29 m6kvm7 pve-ha-crm[16196]: service 'vm:997': state changed from 'migrate' to 'started' (node = m6kvm1) .... Sep 3 10:40:41 m6kvm7 pve-ha-crm[16196]: node 'm6kvm2': state changed from 'online' => 'unknown' Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied Sep 3 10:40:56 m6kvm7 pve-ha-lrm[16140]: status change active => lost_agent_lock Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change master => lost_manager_lock Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: watchdog closed (disabled) Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change lost_manager_lock => wait_for_quorum others nodes timing -------------------- 10:39:16 -> node2 shutdown, leave coroync 10:40:25 -> other nodes rebooted by watchdog ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "dietmar" <dietmar@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Dimanche 6 Septembre 2020 07:36:10 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>But the pve logs look ok, and there is no indication >>that we stopped updating the watchdog. So why did the >>watchdog trigger? Maybe an IPMI bug? do you mean an ipmi bug on all 13 servers at the same time ? (I also have 2 supermicro servers in this cluster, but they use same ipmi watchdog driver. (ipmi_watchdog) I had same kind of with bug once (when stopping a server), on another cluster, 6 months ago. This was without HA, but different version of corosync, and that time, I was really seeing quorum split in the corosync logs of the servers. I'll try to reproduce with a virtual cluster with 14 nodes (don't have enough hardware) Could I be a bug in proxmox HA code, where watchdog is not resetted by LRM anymore? ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Dimanche 6 Septembre 2020 06:21:55 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > >>So you are using ipmi hardware watchdog? > > yes, I'm using dell idrac ipmi card watchdog But the pve logs look ok, and there is no indication that we stopped updating the watchdog. So why did the watchdog trigger? Maybe an IPMI bug? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-06 8:43 ` Alexandre DERUMIER @ 2020-09-06 12:14 ` dietmar 2020-09-06 12:19 ` dietmar 2020-09-07 7:19 ` Alexandre DERUMIER 0 siblings, 2 replies; 84+ messages in thread From: dietmar @ 2020-09-06 12:14 UTC (permalink / raw) To: Alexandre DERUMIER, Proxmox VE development discussion; +Cc: pve-devel > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) Indeed, this should not happen. Do you use a spearate network for corosync? Or was there high traffic on the network? What kind of maintenance was the reason for the shutdown? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-06 12:14 ` dietmar @ 2020-09-06 12:19 ` dietmar 2020-09-07 7:00 ` Thomas Lamprecht 2020-09-07 7:19 ` Alexandre DERUMIER 1 sibling, 1 reply; 84+ messages in thread From: dietmar @ 2020-09-06 12:19 UTC (permalink / raw) To: Alexandre DERUMIER, Proxmox VE development discussion; +Cc: pve-devel > On 09/06/2020 2:14 PM dietmar <dietmar@proxmox.com> wrote: > > > > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) > > Indeed, this should not happen. Do you use a spearate network for corosync? Or > was there high traffic on the network? What kind of maintenance was the reason > for the shutdown? Do you use the default corosync timeout values, or do you have a special setup? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-06 12:19 ` dietmar @ 2020-09-07 7:00 ` Thomas Lamprecht 0 siblings, 0 replies; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-07 7:00 UTC (permalink / raw) To: Proxmox VE development discussion, dietmar, Alexandre DERUMIER On 06.09.20 14:19, dietmar wrote: >> On 09/06/2020 2:14 PM dietmar <dietmar@proxmox.com> wrote: >> >> >>> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) >>> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) >> >> Indeed, this should not happen. Do you use a spearate network for corosync? Or >> was there high traffic on the network? What kind of maintenance was the reason >> for the shutdown? > > Do you use the default corosync timeout values, or do you have a special setup? > Can you please post the full corosync config? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-06 12:14 ` dietmar 2020-09-06 12:19 ` dietmar @ 2020-09-07 7:19 ` Alexandre DERUMIER 2020-09-07 8:18 ` dietmar 1 sibling, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-07 7:19 UTC (permalink / raw) To: dietmar; +Cc: Proxmox VE development discussion, pve-devel >>Indeed, this should not happen. Do you use a spearate network for corosync? No, I use 2x40GB lacp link. >>was there high traffic on the network? but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s, so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux. I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug. >>What kind of maintenance was the reason for the shutdown? ram upgrade. (the server was running ok before shutdown, no hardware problem) (I just shutdown the server, and don't have started it yet when problem occur) >>Do you use the default corosync timeout values, or do you have a special setup? no special tuning, default values. (I don't have any retransmit since months in the logs) >>Can you please post the full corosync config? (I have verified, the running version was corosync was 3.0.3 with libknet 1.15) here the config: " logging { debug: off to_syslog: yes } nodelist { node { name: m6kvm1 nodeid: 1 quorum_votes: 1 ring0_addr: m6kvm1 } node { name: m6kvm10 nodeid: 10 quorum_votes: 1 ring0_addr: m6kvm10 } node { name: m6kvm11 nodeid: 11 quorum_votes: 1 ring0_addr: m6kvm11 } node { name: m6kvm12 nodeid: 12 quorum_votes: 1 ring0_addr: m6kvm12 } node { name: m6kvm13 nodeid: 13 quorum_votes: 1 ring0_addr: m6kvm13 } node { name: m6kvm14 nodeid: 14 quorum_votes: 1 ring0_addr: m6kvm14 } node { name: m6kvm2 nodeid: 2 quorum_votes: 1 ring0_addr: m6kvm2 } node { name: m6kvm3 nodeid: 3 quorum_votes: 1 ring0_addr: m6kvm3 } node { name: m6kvm4 nodeid: 4 quorum_votes: 1 ring0_addr: m6kvm4 } node { name: m6kvm5 nodeid: 5 quorum_votes: 1 ring0_addr: m6kvm5 } node { name: m6kvm6 nodeid: 6 quorum_votes: 1 ring0_addr: m6kvm6 } node { name: m6kvm7 nodeid: 7 quorum_votes: 1 ring0_addr: m6kvm7 } node { name: m6kvm8 nodeid: 8 quorum_votes: 1 ring0_addr: m6kvm8 } node { name: m6kvm9 nodeid: 9 quorum_votes: 1 ring0_addr: m6kvm9 } } quorum { provider: corosync_votequorum } totem { cluster_name: m6kvm config_version: 19 interface { bindnetaddr: 10.3.94.89 ringnumber: 0 } ip_version: ipv4 secauth: on transport: knet version: 2 } ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "pve-devel" <pve-devel@pve.proxmox.com> Envoyé: Dimanche 6 Septembre 2020 14:14:06 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) Indeed, this should not happen. Do you use a spearate network for corosync? Or was there high traffic on the network? What kind of maintenance was the reason for the shutdown? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-07 7:19 ` Alexandre DERUMIER @ 2020-09-07 8:18 ` dietmar 2020-09-07 9:32 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: dietmar @ 2020-09-07 8:18 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: Proxmox VE development discussion There is a similar report in the forum: https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 No HA involved... > On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier@odiso.com> wrote: > > > >>Indeed, this should not happen. Do you use a spearate network for corosync? > > No, I use 2x40GB lacp link. > > >>was there high traffic on the network? > > but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) > > > The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) > > > From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s, > so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux. > > > I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug. > > > > >>What kind of maintenance was the reason for the shutdown? > > ram upgrade. (the server was running ok before shutdown, no hardware problem) > (I just shutdown the server, and don't have started it yet when problem occur) > > > > >>Do you use the default corosync timeout values, or do you have a special setup? > > > no special tuning, default values. (I don't have any retransmit since months in the logs) > > >>Can you please post the full corosync config? > > (I have verified, the running version was corosync was 3.0.3 with libknet 1.15) > > > here the config: > > " > logging { > debug: off > to_syslog: yes > } > > nodelist { > node { > name: m6kvm1 > nodeid: 1 > quorum_votes: 1 > ring0_addr: m6kvm1 > } > node { > name: m6kvm10 > nodeid: 10 > quorum_votes: 1 > ring0_addr: m6kvm10 > } > node { > name: m6kvm11 > nodeid: 11 > quorum_votes: 1 > ring0_addr: m6kvm11 > } > node { > name: m6kvm12 > nodeid: 12 > quorum_votes: 1 > ring0_addr: m6kvm12 > } > node { > name: m6kvm13 > nodeid: 13 > quorum_votes: 1 > ring0_addr: m6kvm13 > } > node { > name: m6kvm14 > nodeid: 14 > quorum_votes: 1 > ring0_addr: m6kvm14 > } > node { > name: m6kvm2 > nodeid: 2 > quorum_votes: 1 > ring0_addr: m6kvm2 > } > node { > name: m6kvm3 > nodeid: 3 > quorum_votes: 1 > ring0_addr: m6kvm3 > } > node { > name: m6kvm4 > nodeid: 4 > quorum_votes: 1 > ring0_addr: m6kvm4 > } > node { > name: m6kvm5 > nodeid: 5 > quorum_votes: 1 > ring0_addr: m6kvm5 > } > node { > name: m6kvm6 > nodeid: 6 > quorum_votes: 1 > ring0_addr: m6kvm6 > } > node { > name: m6kvm7 > nodeid: 7 > quorum_votes: 1 > ring0_addr: m6kvm7 > } > > node { > name: m6kvm8 > nodeid: 8 > quorum_votes: 1 > ring0_addr: m6kvm8 > } > node { > name: m6kvm9 > nodeid: 9 > quorum_votes: 1 > ring0_addr: m6kvm9 > } > } > > quorum { > provider: corosync_votequorum > } > > totem { > cluster_name: m6kvm > config_version: 19 > interface { > bindnetaddr: 10.3.94.89 > ringnumber: 0 > } > ip_version: ipv4 > secauth: on > transport: knet > version: 2 > } > > > > ----- Mail original ----- > De: "dietmar" <dietmar@proxmox.com> > À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> > Cc: "pve-devel" <pve-devel@pve.proxmox.com> > Envoyé: Dimanche 6 Septembre 2020 14:14:06 > Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > > > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) > > Indeed, this should not happen. Do you use a spearate network for corosync? Or > was there high traffic on the network? What kind of maintenance was the reason > for the shutdown? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-07 8:18 ` dietmar @ 2020-09-07 9:32 ` Alexandre DERUMIER 2020-09-07 13:23 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-07 9:32 UTC (permalink / raw) To: dietmar; +Cc: Proxmox VE development discussion >>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 >> >>No HA involved... I had already help this user some week ago https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboot-need-help.74643/#post-333093 HA was actived at this time. (Maybe the watchdog was still running, I'm not sure if you disable HA from all vms if LRM disable the watchdog ?) ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 7 Septembre 2020 10:18:42 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown There is a similar report in the forum: https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 No HA involved... > On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier@odiso.com> wrote: > > > >>Indeed, this should not happen. Do you use a spearate network for corosync? > > No, I use 2x40GB lacp link. > > >>was there high traffic on the network? > > but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) > > > The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) > > > From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s, > so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux. > > > I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug. > > > > >>What kind of maintenance was the reason for the shutdown? > > ram upgrade. (the server was running ok before shutdown, no hardware problem) > (I just shutdown the server, and don't have started it yet when problem occur) > > > > >>Do you use the default corosync timeout values, or do you have a special setup? > > > no special tuning, default values. (I don't have any retransmit since months in the logs) > > >>Can you please post the full corosync config? > > (I have verified, the running version was corosync was 3.0.3 with libknet 1.15) > > > here the config: > > " > logging { > debug: off > to_syslog: yes > } > > nodelist { > node { > name: m6kvm1 > nodeid: 1 > quorum_votes: 1 > ring0_addr: m6kvm1 > } > node { > name: m6kvm10 > nodeid: 10 > quorum_votes: 1 > ring0_addr: m6kvm10 > } > node { > name: m6kvm11 > nodeid: 11 > quorum_votes: 1 > ring0_addr: m6kvm11 > } > node { > name: m6kvm12 > nodeid: 12 > quorum_votes: 1 > ring0_addr: m6kvm12 > } > node { > name: m6kvm13 > nodeid: 13 > quorum_votes: 1 > ring0_addr: m6kvm13 > } > node { > name: m6kvm14 > nodeid: 14 > quorum_votes: 1 > ring0_addr: m6kvm14 > } > node { > name: m6kvm2 > nodeid: 2 > quorum_votes: 1 > ring0_addr: m6kvm2 > } > node { > name: m6kvm3 > nodeid: 3 > quorum_votes: 1 > ring0_addr: m6kvm3 > } > node { > name: m6kvm4 > nodeid: 4 > quorum_votes: 1 > ring0_addr: m6kvm4 > } > node { > name: m6kvm5 > nodeid: 5 > quorum_votes: 1 > ring0_addr: m6kvm5 > } > node { > name: m6kvm6 > nodeid: 6 > quorum_votes: 1 > ring0_addr: m6kvm6 > } > node { > name: m6kvm7 > nodeid: 7 > quorum_votes: 1 > ring0_addr: m6kvm7 > } > > node { > name: m6kvm8 > nodeid: 8 > quorum_votes: 1 > ring0_addr: m6kvm8 > } > node { > name: m6kvm9 > nodeid: 9 > quorum_votes: 1 > ring0_addr: m6kvm9 > } > } > > quorum { > provider: corosync_votequorum > } > > totem { > cluster_name: m6kvm > config_version: 19 > interface { > bindnetaddr: 10.3.94.89 > ringnumber: 0 > } > ip_version: ipv4 > secauth: on > transport: knet > version: 2 > } > > > > ----- Mail original ----- > De: "dietmar" <dietmar@proxmox.com> > À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> > Cc: "pve-devel" <pve-devel@pve.proxmox.com> > Envoyé: Dimanche 6 Septembre 2020 14:14:06 > Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > > > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) > > Indeed, this should not happen. Do you use a spearate network for corosync? Or > was there high traffic on the network? What kind of maintenance was the reason > for the shutdown? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-07 9:32 ` Alexandre DERUMIER @ 2020-09-07 13:23 ` Alexandre DERUMIER 2020-09-08 4:41 ` dietmar 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-07 13:23 UTC (permalink / raw) To: Proxmox VE development discussion Looking at theses logs: Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied in PVE/HA/Env/PVE2.pm " my $ctime = time(); my $last_lock_time = $last->{lock_time} // 0; my $last_got_lock = $last->{got_lock}; my $retry_timeout = 120; # hardcoded lock lifetime limit from pmxcfs eval { mkdir $lockdir; # pve cluster filesystem not online die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir; if (($ctime - $last_lock_time) < $retry_timeout) { # try cfs lock update request (utime) if (utime(0, $ctime, $filename)) { $got_lock = 1; return; } die "cfs lock update failed - $!\n"; } " If the retry_timeout is = 120, could it explain why I don't have log on others node, if the watchdog trigger after 60s ? I don't known too much how locks are working in pmxcfs, but when a corosync member leave or join, and a new cluster memership is formed, could we have some lock lost or hang ? ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "dietmar" <dietmar@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 7 Septembre 2020 11:32:13 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 >> >>No HA involved... I had already help this user some week ago https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboot-need-help.74643/#post-333093 HA was actived at this time. (Maybe the watchdog was still running, I'm not sure if you disable HA from all vms if LRM disable the watchdog ?) ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 7 Septembre 2020 10:18:42 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown There is a similar report in the forum: https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 No HA involved... > On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier@odiso.com> wrote: > > > >>Indeed, this should not happen. Do you use a spearate network for corosync? > > No, I use 2x40GB lacp link. > > >>was there high traffic on the network? > > but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) > > > The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) > > > From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s, > so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux. > > > I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug. > > > > >>What kind of maintenance was the reason for the shutdown? > > ram upgrade. (the server was running ok before shutdown, no hardware problem) > (I just shutdown the server, and don't have started it yet when problem occur) > > > > >>Do you use the default corosync timeout values, or do you have a special setup? > > > no special tuning, default values. (I don't have any retransmit since months in the logs) > > >>Can you please post the full corosync config? > > (I have verified, the running version was corosync was 3.0.3 with libknet 1.15) > > > here the config: > > " > logging { > debug: off > to_syslog: yes > } > > nodelist { > node { > name: m6kvm1 > nodeid: 1 > quorum_votes: 1 > ring0_addr: m6kvm1 > } > node { > name: m6kvm10 > nodeid: 10 > quorum_votes: 1 > ring0_addr: m6kvm10 > } > node { > name: m6kvm11 > nodeid: 11 > quorum_votes: 1 > ring0_addr: m6kvm11 > } > node { > name: m6kvm12 > nodeid: 12 > quorum_votes: 1 > ring0_addr: m6kvm12 > } > node { > name: m6kvm13 > nodeid: 13 > quorum_votes: 1 > ring0_addr: m6kvm13 > } > node { > name: m6kvm14 > nodeid: 14 > quorum_votes: 1 > ring0_addr: m6kvm14 > } > node { > name: m6kvm2 > nodeid: 2 > quorum_votes: 1 > ring0_addr: m6kvm2 > } > node { > name: m6kvm3 > nodeid: 3 > quorum_votes: 1 > ring0_addr: m6kvm3 > } > node { > name: m6kvm4 > nodeid: 4 > quorum_votes: 1 > ring0_addr: m6kvm4 > } > node { > name: m6kvm5 > nodeid: 5 > quorum_votes: 1 > ring0_addr: m6kvm5 > } > node { > name: m6kvm6 > nodeid: 6 > quorum_votes: 1 > ring0_addr: m6kvm6 > } > node { > name: m6kvm7 > nodeid: 7 > quorum_votes: 1 > ring0_addr: m6kvm7 > } > > node { > name: m6kvm8 > nodeid: 8 > quorum_votes: 1 > ring0_addr: m6kvm8 > } > node { > name: m6kvm9 > nodeid: 9 > quorum_votes: 1 > ring0_addr: m6kvm9 > } > } > > quorum { > provider: corosync_votequorum > } > > totem { > cluster_name: m6kvm > config_version: 19 > interface { > bindnetaddr: 10.3.94.89 > ringnumber: 0 > } > ip_version: ipv4 > secauth: on > transport: knet > version: 2 > } > > > > ----- Mail original ----- > De: "dietmar" <dietmar@proxmox.com> > À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> > Cc: "pve-devel" <pve-devel@pve.proxmox.com> > Envoyé: Dimanche 6 Septembre 2020 14:14:06 > Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > > > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) > > Indeed, this should not happen. Do you use a spearate network for corosync? Or > was there high traffic on the network? What kind of maintenance was the reason > for the shutdown? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-07 13:23 ` Alexandre DERUMIER @ 2020-09-08 4:41 ` dietmar 2020-09-08 7:11 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: dietmar @ 2020-09-08 4:41 UTC (permalink / raw) To: Alexandre DERUMIER, Proxmox VE development discussion > I don't known too much how locks are working in pmxcfs, but when a corosync member leave or join, and a new cluster memership is formed, could we have some lock lost or hang ? It would really help if we can reproduce the bug somehow. Do you have and idea how to trigger the bug? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-08 4:41 ` dietmar @ 2020-09-08 7:11 ` Alexandre DERUMIER 2020-09-09 20:05 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-08 7:11 UTC (permalink / raw) To: dietmar; +Cc: Proxmox VE development discussion >>It would really help if we can reproduce the bug somehow. Do you have and idea how >>to trigger the bug? I really don't known. I'm currently trying to reproduce on the same cluster, with softdog && noboot=1, and rebooting node. Maybe it's related with the number of vms, or the number of nodes, don't have any clue ... ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 8 Septembre 2020 06:41:10 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > I don't known too much how locks are working in pmxcfs, but when a corosync member leave or join, and a new cluster memership is formed, could we have some lock lost or hang ? It would really help if we can reproduce the bug somehow. Do you have and idea how to trigger the bug? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-08 7:11 ` Alexandre DERUMIER @ 2020-09-09 20:05 ` Thomas Lamprecht 2020-09-10 4:58 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-09 20:05 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER, dietmar On 08.09.20 09:11, Alexandre DERUMIER wrote: >>> It would really help if we can reproduce the bug somehow. Do you have and idea how >>> to trigger the bug? > > I really don't known. I'm currently trying to reproduce on the same cluster, with softdog && noboot=1, and rebooting node. > > > Maybe it's related with the number of vms, or the number of nodes, don't have any clue ... I checked a bit the watchdog code, our user-space mux one and the kernel drivers, and just noting a few things here (thinking out aloud): The /dev/watchdog itself is always active, else we could loose it to some other program and not be able to activate HA dynamically. But, as long as no HA service got active, it's a simple dummy "wake up every second and do an ioctl keep-alive update". This is really simple and efficiently written, so if that fails for over 10s the systems is really loaded, probably barely responding to anything. Currently the watchdog-mux runs as normal process, no re-nice, no real-time scheduling. This is IMO wrong, as it is a critical process which needs to be run with high priority. I've a patch here which sets it to the highest RR realtime-scheduling priority available, effectively the same what corosync does. diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c index 818ae00..71981d7 100644 --- a/src/watchdog-mux.c +++ b/src/watchdog-mux.c @@ -8,2 +8,3 @@ #include <time.h> +#include <sched.h> #include <sys/ioctl.h> @@ -151,2 +177,15 @@ main(void) + int sched_priority = sched_get_priority_max (SCHED_RR); + if (sched_priority != -1) { + struct sched_param global_sched_param; + global_sched_param.sched_priority = sched_priority; + int res = sched_setscheduler (0, SCHED_RR, &global_sched_param); + if (res == -1) { + fprintf(stderr, "Could not set SCHED_RR at priority %d\n", sched_priority); + } else { + fprintf(stderr, "set SCHED_RR at priority %d\n", sched_priority); + } + } + + if ((watchdog_fd = open(WATCHDOG_DEV, O_WRONLY)) == -1) { The issue with no HA but watchdog reset due to massively overloaded system should be avoided already a lot with the scheduling change alone. Interesting, IMO, is that lots of nodes rebooted at the same time, with no HA active. This *could* come from a side-effect like ceph rebalacing kicking off and producing a load spike for >10s, hindering the scheduling of the watchdog-mux. This is a theory, but with HA off it needs to be something like that, as in HA-off case there's *no* direct or indirect connection between corosync/pmxcfs and the watchdog-mux. It simply does not cares, or notices, quorum partition changes at all. There may be a approach to reserve the watchdog for the mux, but avoid having it as "ticking time bomb": Theoretically one could open it, then disable it with an ioctl (it can be queried if a driver support that) and only enable it for real once the first client connects to the MUX. This may not work for all watchdog modules, and if, we may want to make it configurable, as some people actually want a reset if a (future) real-time process cannot be scheduled for >= 10 seconds. With HA active, well then there could be something off, either in corosync/knet or also in how we interface with it in pmxcfs, that could well be, but won't explain the non-HA issues. Speaking of pmxcfs, that one runs also with standard priority, we may want to change that too to a RT scheduler, so that its ensured it can process all corosync events. I have also a few other small watchdog mux patches around, it should nowadays actually be able to tell us why a reset happened (can also be over/under voltage, temperature, ...) and I'll repeat doing the ioctl for keep-alive a few times if it fails, can only win with that after all. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-09 20:05 ` Thomas Lamprecht @ 2020-09-10 4:58 ` Alexandre DERUMIER 2020-09-10 8:21 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-10 4:58 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion, dietmar Thanks Thomas for the investigations. I'm still trying to reproduce... I think I have some special case here, because the user of the forum with 30 nodes had corosync cluster split. (Note that I had this bug 6 months ago,when shuting down a node too, and the only way was stop full stop corosync on all nodes, and start corosync again on all nodes). But this time, corosync logs looks fine. (every node, correctly see node2 down, and see remaning nodes) surviving node7, was the only node with HA, and LRM didn't have enable watchog (I don't have found any log like "pve-ha-lrm: watchdog active" for the last 6months on this nodes So, the timing was: 10:39:05 : "halt" command is send to node2 10:39:16 : node2 is leaving corosync / halt -> every node is seeing it and correctly do a new membership with 13 remaining nodes ...don't see any special logs (corosync,pmxcfs,pve-ha-crm,pve-ha-lrm) after the node2 leaving. But they are still activity on the server, pve-firewall is still logging, vms are running fine between 10:40:25 - 10:40:34 : watchdog reset nodes, but not node7. -> so between 70s-80s after the node2 was done, so I think that watchdog-mux was still running fine until that. (That's sound like lrm was stuck, and client_watchdog_timeout have expired in watchdog-mux) 10:40:41 node7, loose quorum (as all others nodes have reset), 10:40:50: node7 crm/lrm finally log. Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied So, I really think that something have stucked lrm/crm loop, and watchdog was not resetted because of that. ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>, "dietmar" <dietmar@proxmox.com> Envoyé: Mercredi 9 Septembre 2020 22:05:49 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 08.09.20 09:11, Alexandre DERUMIER wrote: >>> It would really help if we can reproduce the bug somehow. Do you have and idea how >>> to trigger the bug? > > I really don't known. I'm currently trying to reproduce on the same cluster, with softdog && noboot=1, and rebooting node. > > > Maybe it's related with the number of vms, or the number of nodes, don't have any clue ... I checked a bit the watchdog code, our user-space mux one and the kernel drivers, and just noting a few things here (thinking out aloud): The /dev/watchdog itself is always active, else we could loose it to some other program and not be able to activate HA dynamically. But, as long as no HA service got active, it's a simple dummy "wake up every second and do an ioctl keep-alive update". This is really simple and efficiently written, so if that fails for over 10s the systems is really loaded, probably barely responding to anything. Currently the watchdog-mux runs as normal process, no re-nice, no real-time scheduling. This is IMO wrong, as it is a critical process which needs to be run with high priority. I've a patch here which sets it to the highest RR realtime-scheduling priority available, effectively the same what corosync does. diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c index 818ae00..71981d7 100644 --- a/src/watchdog-mux.c +++ b/src/watchdog-mux.c @@ -8,2 +8,3 @@ #include <time.h> +#include <sched.h> #include <sys/ioctl.h> @@ -151,2 +177,15 @@ main(void) + int sched_priority = sched_get_priority_max (SCHED_RR); + if (sched_priority != -1) { + struct sched_param global_sched_param; + global_sched_param.sched_priority = sched_priority; + int res = sched_setscheduler (0, SCHED_RR, &global_sched_param); + if (res == -1) { + fprintf(stderr, "Could not set SCHED_RR at priority %d\n", sched_priority); + } else { + fprintf(stderr, "set SCHED_RR at priority %d\n", sched_priority); + } + } + + if ((watchdog_fd = open(WATCHDOG_DEV, O_WRONLY)) == -1) { The issue with no HA but watchdog reset due to massively overloaded system should be avoided already a lot with the scheduling change alone. Interesting, IMO, is that lots of nodes rebooted at the same time, with no HA active. This *could* come from a side-effect like ceph rebalacing kicking off and producing a load spike for >10s, hindering the scheduling of the watchdog-mux. This is a theory, but with HA off it needs to be something like that, as in HA-off case there's *no* direct or indirect connection between corosync/pmxcfs and the watchdog-mux. It simply does not cares, or notices, quorum partition changes at all. There may be a approach to reserve the watchdog for the mux, but avoid having it as "ticking time bomb": Theoretically one could open it, then disable it with an ioctl (it can be queried if a driver support that) and only enable it for real once the first client connects to the MUX. This may not work for all watchdog modules, and if, we may want to make it configurable, as some people actually want a reset if a (future) real-time process cannot be scheduled for >= 10 seconds. With HA active, well then there could be something off, either in corosync/knet or also in how we interface with it in pmxcfs, that could well be, but won't explain the non-HA issues. Speaking of pmxcfs, that one runs also with standard priority, we may want to change that too to a RT scheduler, so that its ensured it can process all corosync events. I have also a few other small watchdog mux patches around, it should nowadays actually be able to tell us why a reset happened (can also be over/under voltage, temperature, ...) and I'll repeat doing the ioctl for keep-alive a few times if it fails, can only win with that after all. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-10 4:58 ` Alexandre DERUMIER @ 2020-09-10 8:21 ` Thomas Lamprecht 2020-09-10 11:34 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-10 8:21 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER On 10.09.20 06:58, Alexandre DERUMIER wrote: > Thanks Thomas for the investigations. > > I'm still trying to reproduce... > I think I have some special case here, because the user of the forum with 30 nodes had corosync cluster split. (Note that I had this bug 6 months ago,when shuting down a node too, and the only way was stop full stop corosync on all nodes, and start corosync again on all nodes). > > > But this time, corosync logs looks fine. (every node, correctly see node2 down, and see remaning nodes) > > surviving node7, was the only node with HA, and LRM didn't have enable watchog (I don't have found any log like "pve-ha-lrm: watchdog active" for the last 6months on this nodes > > > So, the timing was: > > 10:39:05 : "halt" command is send to node2 > 10:39:16 : node2 is leaving corosync / halt -> every node is seeing it and correctly do a new membership with 13 remaining nodes > > ...don't see any special logs (corosync,pmxcfs,pve-ha-crm,pve-ha-lrm) after the node2 leaving. > But they are still activity on the server, pve-firewall is still logging, vms are running fine > > > between 10:40:25 - 10:40:34 : watchdog reset nodes, but not node7. > > -> so between 70s-80s after the node2 was done, so I think that watchdog-mux was still running fine until that. > (That's sound like lrm was stuck, and client_watchdog_timeout have expired in watchdog-mux) as said, if the other nodes where not using HA, the watchdog-mux had no client which could expire. > > 10:40:41 node7, loose quorum (as all others nodes have reset), > 10:40:50: node7 crm/lrm finally log. > > Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) above lines also indicate very high load. Do you have some monitoring which shows the CPU/IO load before/during this event? > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied > > > > So, I really think that something have stucked lrm/crm loop, and watchdog was not resetted because of that. > ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-10 8:21 ` Thomas Lamprecht @ 2020-09-10 11:34 ` Alexandre DERUMIER 2020-09-10 18:21 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-10 11:34 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion >>as said, if the other nodes where not using HA, the watchdog-mux had no >>client which could expire. sorry, maybe I have wrong explained it, but all my nodes had HA enabled. I have double check lrm_status json files from my morning backup 2h before the problem, they were all in "active" state. ("state":"active","mode":"active" ) I don't why node7 don't have rebooted, the only difference is that is was the crm master. (I think crm also reset the watchdog counter ? maybe behaviour is different than lrm ?) >>above lines also indicate very high load. >>Do you have some monitoring which shows the CPU/IO load before/during this event? load (1,5,15 ) was: 6 (for 48cores), cpu usage: 23% no iowait on disk (vms are on a remote ceph, only proxmox services are running on local ssd disk) so nothing strange here :/ ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Alexandre Derumier" <aderumier@odiso.com> Envoyé: Jeudi 10 Septembre 2020 10:21:48 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 10.09.20 06:58, Alexandre DERUMIER wrote: > Thanks Thomas for the investigations. > > I'm still trying to reproduce... > I think I have some special case here, because the user of the forum with 30 nodes had corosync cluster split. (Note that I had this bug 6 months ago,when shuting down a node too, and the only way was stop full stop corosync on all nodes, and start corosync again on all nodes). > > > But this time, corosync logs looks fine. (every node, correctly see node2 down, and see remaning nodes) > > surviving node7, was the only node with HA, and LRM didn't have enable watchog (I don't have found any log like "pve-ha-lrm: watchdog active" for the last 6months on this nodes > > > So, the timing was: > > 10:39:05 : "halt" command is send to node2 > 10:39:16 : node2 is leaving corosync / halt -> every node is seeing it and correctly do a new membership with 13 remaining nodes > > ...don't see any special logs (corosync,pmxcfs,pve-ha-crm,pve-ha-lrm) after the node2 leaving. > But they are still activity on the server, pve-firewall is still logging, vms are running fine > > > between 10:40:25 - 10:40:34 : watchdog reset nodes, but not node7. > > -> so between 70s-80s after the node2 was done, so I think that watchdog-mux was still running fine until that. > (That's sound like lrm was stuck, and client_watchdog_timeout have expired in watchdog-mux) as said, if the other nodes where not using HA, the watchdog-mux had no client which could expire. > > 10:40:41 node7, loose quorum (as all others nodes have reset), > 10:40:50: node7 crm/lrm finally log. > > Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) above lines also indicate very high load. Do you have some monitoring which shows the CPU/IO load before/during this event? > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied > > > > So, I really think that something have stucked lrm/crm loop, and watchdog was not resetted because of that. > ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-10 11:34 ` Alexandre DERUMIER @ 2020-09-10 18:21 ` Thomas Lamprecht 2020-09-14 4:54 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-10 18:21 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER On 10.09.20 13:34, Alexandre DERUMIER wrote: >>> as said, if the other nodes where not using HA, the watchdog-mux had no >>> client which could expire. > > sorry, maybe I have wrong explained it, > but all my nodes had HA enabled. > > I have double check lrm_status json files from my morning backup 2h before the problem, > they were all in "active" state. ("state":"active","mode":"active" ) > OK, so all had a connection to the watchdog-mux open. This shifts the suspicion again over to pmxcfs and/or corosync. > I don't why node7 don't have rebooted, the only difference is that is was the crm master. > (I think crm also reset the watchdog counter ? maybe behaviour is different than lrm ?) The watchdog-mux stops updating the real watchdog as soon any client disconnects or times out. It does not know which client (daemon) that was. >>> above lines also indicate very high load. >>> Do you have some monitoring which shows the CPU/IO load before/during this event? > > load (1,5,15 ) was: 6 (for 48cores), cpu usage: 23% > no iowait on disk (vms are on a remote ceph, only proxmox services are running on local ssd disk) > > so nothing strange here :/ Hmm, the long loop times could then be the effect of a pmxcfs read or write operation being (temporarily) stuck. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-10 18:21 ` Thomas Lamprecht @ 2020-09-14 4:54 ` Alexandre DERUMIER 2020-09-14 7:14 ` Dietmar Maurer 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-14 4:54 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? http://manpages.ubuntu.com/manpages/bionic/man8/sbd.8.html (shared disk heartbeat). Something like a independent daemon (not using corosync/pmxcfs/...), also connected to watchdog muxer. ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com> Envoyé: Jeudi 10 Septembre 2020 20:21:14 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 10.09.20 13:34, Alexandre DERUMIER wrote: >>> as said, if the other nodes where not using HA, the watchdog-mux had no >>> client which could expire. > > sorry, maybe I have wrong explained it, > but all my nodes had HA enabled. > > I have double check lrm_status json files from my morning backup 2h before the problem, > they were all in "active" state. ("state":"active","mode":"active" ) > OK, so all had a connection to the watchdog-mux open. This shifts the suspicion again over to pmxcfs and/or corosync. > I don't why node7 don't have rebooted, the only difference is that is was the crm master. > (I think crm also reset the watchdog counter ? maybe behaviour is different than lrm ?) The watchdog-mux stops updating the real watchdog as soon any client disconnects or times out. It does not know which client (daemon) that was. >>> above lines also indicate very high load. >>> Do you have some monitoring which shows the CPU/IO load before/during this event? > > load (1,5,15 ) was: 6 (for 48cores), cpu usage: 23% > no iowait on disk (vms are on a remote ceph, only proxmox services are running on local ssd disk) > > so nothing strange here :/ Hmm, the long loop times could then be the effect of a pmxcfs read or write operation being (temporarily) stuck. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-14 4:54 ` Alexandre DERUMIER @ 2020-09-14 7:14 ` Dietmar Maurer 2020-09-14 8:27 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Dietmar Maurer @ 2020-09-14 7:14 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER, Thomas Lamprecht > I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? AFAIK Thomas already has patches to implement active fencing. But IMHO this will not solve the corosync problems.. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-14 7:14 ` Dietmar Maurer @ 2020-09-14 8:27 ` Alexandre DERUMIER 2020-09-14 8:51 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-14 8:27 UTC (permalink / raw) To: dietmar; +Cc: Proxmox VE development discussion, Thomas Lamprecht > I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? >>AFAIK Thomas already has patches to implement active fencing. >>But IMHO this will not solve the corosync problems.. Yes, sure. I'm really to have to 2 differents sources of verification, with different path/software, to avoid this kind of bug. (shit happens, murphy law ;) as we say in French "ceinture & bretelles" -> "belt and braces" BTW, a user have reported new corosync problem here: https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871 (Sound like the bug that I have 6month ago, with corosync bug flooding a lof of udp packets, but not the same bug I have here) ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Lundi 14 Septembre 2020 09:14:50 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? AFAIK Thomas already has patches to implement active fencing. But IMHO this will not solve the corosync problems.. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-14 8:27 ` Alexandre DERUMIER @ 2020-09-14 8:51 ` Thomas Lamprecht 2020-09-14 15:45 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-14 8:51 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER, dietmar On 9/14/20 10:27 AM, Alexandre DERUMIER wrote: >> I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? > >>> AFAIK Thomas already has patches to implement active fencing. > >>> But IMHO this will not solve the corosync problems.. > > Yes, sure. I'm really to have to 2 differents sources of verification, with different path/software, to avoid this kind of bug. > (shit happens, murphy law ;) would then need at least three, and if one has a bug flooding the network in a lot of setups (not having beefy switches like you ;) the other two will be taken down also, either as memory or the system stack gets overloaded. > > as we say in French "ceinture & bretelles" -> "belt and braces" > > > BTW, > a user have reported new corosync problem here: > https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871 > (Sound like the bug that I have 6month ago, with corosync bug flooding a lof of udp packets, but not the same bug I have here) Did you get in contact with knet/corosync devs about this? Because, it may well be something their stack is better at handling it, maybe there's also really still a bug, or bad behaviour on some edge cases... ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-14 8:51 ` Thomas Lamprecht @ 2020-09-14 15:45 ` Alexandre DERUMIER 2020-09-15 5:45 ` dietmar 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-14 15:45 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion, dietmar >>Did you get in contact with knet/corosync devs about this? >>Because, it may well be something their stack is better at handling it, maybe >>there's also really still a bug, or bad behaviour on some edge cases... not yet, I would like to have more infos to submit, because I'm blind. I have enabled debug logs on all my cluster if that happen again. BTW, I have noticed something, corosync is stopped after syslog stop, so at shutdown we never have corosync log I have edit corosync.service - After=network-online.target + After=network-online.target syslog.target and now It's logging correctly. Now, that logging work, I'm also seeeing pmxcfs errors when corosync is stopping. (But no pmxcfs shutdown log) Do you think it's possible to have a clean shutdown of pmxcfs first, before stopping corosync ? " Sep 14 17:23:49 pve corosync[1346]: [MAIN ] Node was shut down by a signal Sep 14 17:23:49 pve systemd[1]: Stopping Corosync Cluster Engine... Sep 14 17:23:49 pve corosync[1346]: [SERV ] Unloading all Corosync service engines. Sep 14 17:23:49 pve corosync[1346]: [QB ] withdrawing server sockets Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: cmap_dispatch failed: 2 Sep 14 17:23:49 pve corosync[1346]: [QB ] withdrawing server sockets Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync configuration map access Sep 14 17:23:49 pve corosync[1346]: [QB ] withdrawing server sockets Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync configuration service Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_dispatch failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_leave failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_dispatch failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_leave failed: 2 Sep 14 17:23:49 pve corosync[1346]: [QB ] withdrawing server sockets Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: quorum_dispatch failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [status] notice: node lost quorum Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync profile loading service Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync resource monitoring service Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync watchdog service Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: quorum_initialize failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: can't initialize service Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: cmap_initialize failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: can't initialize service Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] notice: start cluster connection Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_initialize failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: can't initialize service Sep 14 17:23:49 pve pmxcfs[1132]: [status] notice: start cluster connection Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_initialize failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: can't initialize service Sep 14 17:23:50 pve corosync[1346]: [MAIN ] Corosync Cluster Engine exiting normally " ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com>, "dietmar" <dietmar@proxmox.com> Envoyé: Lundi 14 Septembre 2020 10:51:03 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/14/20 10:27 AM, Alexandre DERUMIER wrote: >> I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? > >>> AFAIK Thomas already has patches to implement active fencing. > >>> But IMHO this will not solve the corosync problems.. > > Yes, sure. I'm really to have to 2 differents sources of verification, with different path/software, to avoid this kind of bug. > (shit happens, murphy law ;) would then need at least three, and if one has a bug flooding the network in a lot of setups (not having beefy switches like you ;) the other two will be taken down also, either as memory or the system stack gets overloaded. > > as we say in French "ceinture & bretelles" -> "belt and braces" > > > BTW, > a user have reported new corosync problem here: > https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871 > (Sound like the bug that I have 6month ago, with corosync bug flooding a lof of udp packets, but not the same bug I have here) Did you get in contact with knet/corosync devs about this? Because, it may well be something their stack is better at handling it, maybe there's also really still a bug, or bad behaviour on some edge cases... ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-14 15:45 ` Alexandre DERUMIER @ 2020-09-15 5:45 ` dietmar 2020-09-15 6:27 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: dietmar @ 2020-09-15 5:45 UTC (permalink / raw) To: Alexandre DERUMIER, Thomas Lamprecht; +Cc: Proxmox VE development discussion > Now, that logging work, I'm also seeeing pmxcfs errors when corosync is stopping. > (But no pmxcfs shutdown log) > > Do you think it's possible to have a clean shutdown of pmxcfs first, before stopping corosync ? This is by intention - we do not want to stop pmxcfs only because coorosync service stops. Unit] Description=The Proxmox VE cluster filesystem ConditionFileIsExecutable=/usr/bin/pmxcfs Wants=corosync.service Wants=rrdcached.service Before=corosync.service Before=ceph.service Before=cron.service After=network.target After=sys-fs-fuse-connections.mount After=time-sync.target After=rrdcached.service DefaultDependencies=no Before=shutdown.target Conflicts=shutdown.target ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 5:45 ` dietmar @ 2020-09-15 6:27 ` Alexandre DERUMIER 2020-09-15 7:13 ` dietmar 2020-09-15 7:58 ` Thomas Lamprecht 0 siblings, 2 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 6:27 UTC (permalink / raw) To: dietmar; +Cc: Thomas Lamprecht, Proxmox VE development discussion >>This is by intention - we do not want to stop pmxcfs only because coorosync service stops. Yes, but at shutdown, it could be great to stop pmxcfs before corosync ? I ask the question, because the 2 times I have problem, it was when shutting down a server. So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ? looking at the pve-cluster unit file, why do we have "Before=corosync.service" and not "After=corosync.service" ? I have tried to change this, but even with that, both are still shutting down in parallel. the only way I have found to have clean shutdown, is "Requires=corosync.server" + "After=corosync.service". But that mean than if you restart corosync, it's restart pmxcfs too first. I have looked at systemd doc, After= should be enough (as at shutdown it's doing the reverse order), but I don't known why corosync don't wait than pve-cluster ??? (Also, I think than pmxcfs is also stopping after syslog, because I never see the pmxcfs "teardown filesystem" logs at shutdown) ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 07:45:41 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > Now, that logging work, I'm also seeeing pmxcfs errors when corosync is stopping. > (But no pmxcfs shutdown log) > > Do you think it's possible to have a clean shutdown of pmxcfs first, before stopping corosync ? This is by intention - we do not want to stop pmxcfs only because coorosync service stops. Unit] Description=The Proxmox VE cluster filesystem ConditionFileIsExecutable=/usr/bin/pmxcfs Wants=corosync.service Wants=rrdcached.service Before=corosync.service Before=ceph.service Before=cron.service After=network.target After=sys-fs-fuse-connections.mount After=time-sync.target After=rrdcached.service DefaultDependencies=no Before=shutdown.target Conflicts=shutdown.target ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 6:27 ` Alexandre DERUMIER @ 2020-09-15 7:13 ` dietmar 2020-09-15 8:42 ` Alexandre DERUMIER 2020-09-15 7:58 ` Thomas Lamprecht 1 sibling, 1 reply; 84+ messages in thread From: dietmar @ 2020-09-15 7:13 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: Thomas Lamprecht, Proxmox VE development discussion > I ask the question, because the 2 times I have problem, it was when shutting down a server. > So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ? pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 7:13 ` dietmar @ 2020-09-15 8:42 ` Alexandre DERUMIER 2020-09-15 9:35 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 8:42 UTC (permalink / raw) To: dietmar; +Cc: Thomas Lamprecht, Proxmox VE development discussion >>pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. yes, I understand that, but I was thinking of the case if corosync is in stopping phase (not totally stopped). Something racy (I really don't known). I just send 2 patch to start pve-cluster && corosync after syslog, like this we'll have shutdown logs too. (I'm currently try to reproduce the problem, with reboot loops, but I still can't reproduce it :/ ) ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 09:13:53 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > I ask the question, because the 2 times I have problem, it was when shutting down a server. > So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ? pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 8:42 ` Alexandre DERUMIER @ 2020-09-15 9:35 ` Alexandre DERUMIER 2020-09-15 9:46 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 9:35 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: dietmar, Thomas Lamprecht Hi, I have finally reproduce it ! But this is with a corosync restart in cron each 1 minute, on node1 Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. here the logs with full corosync debug at the time of last corosync restart. node1 (where corosync is restarted each minute) https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e node2 https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 node5 https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "dietmar" <dietmar@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Mardi 15 Septembre 2020 10:42:15 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. yes, I understand that, but I was thinking of the case if corosync is in stopping phase (not totally stopped). Something racy (I really don't known). I just send 2 patch to start pve-cluster && corosync after syslog, like this we'll have shutdown logs too. (I'm currently try to reproduce the problem, with reboot loops, but I still can't reproduce it :/ ) ----- Mail original ----- De: "dietmar" <dietmar@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 09:13:53 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > I ask the question, because the 2 times I have problem, it was when shutting down a server. > So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ? pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 9:35 ` Alexandre DERUMIER @ 2020-09-15 9:46 ` Thomas Lamprecht 2020-09-15 10:15 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-15 9:46 UTC (permalink / raw) To: Alexandre DERUMIER, Proxmox VE development discussion On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: > Hi, > > I have finally reproduce it ! > > But this is with a corosync restart in cron each 1 minute, on node1 > > Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. > > here the logs with full corosync debug at the time of last corosync restart. > > node1 (where corosync is restarted each minute) > https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e > > node2 > https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 > > node5 > https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 > > I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 > Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 9:46 ` Thomas Lamprecht @ 2020-09-15 10:15 ` Alexandre DERUMIER 2020-09-15 11:04 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 10:15 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion, dietmar here the previous restart log node1 -> corosync restart at 10:46:15 ----- https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7 node2 ----- https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba node5 ------ https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37 ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "dietmar" <dietmar@proxmox.com> Envoyé: Mardi 15 Septembre 2020 11:46:51 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: > Hi, > > I have finally reproduce it ! > > But this is with a corosync restart in cron each 1 minute, on node1 > > Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. > > here the logs with full corosync debug at the time of last corosync restart. > > node1 (where corosync is restarted each minute) > https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e > > node2 > https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 > > node5 > https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 > > I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 > Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 10:15 ` Alexandre DERUMIER @ 2020-09-15 11:04 ` Alexandre DERUMIER 2020-09-15 12:49 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 11:04 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht also logs of node14, where the lrm was not too long https://gist.github.com/aderumier/a2e2d6afc7e04646c923ae6f37cb6c2d ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 12:15:47 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown here the previous restart log node1 -> corosync restart at 10:46:15 ----- https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7 node2 ----- https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba node5 ------ https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37 ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "dietmar" <dietmar@proxmox.com> Envoyé: Mardi 15 Septembre 2020 11:46:51 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: > Hi, > > I have finally reproduce it ! > > But this is with a corosync restart in cron each 1 minute, on node1 > > Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. > > here the logs with full corosync debug at the time of last corosync restart. > > node1 (where corosync is restarted each minute) > https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e > > node2 > https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 > > node5 > https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 > > I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 > Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 11:04 ` Alexandre DERUMIER @ 2020-09-15 12:49 ` Alexandre DERUMIER 2020-09-15 13:00 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 12:49 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht Hi, I have produce it again, now I can't write to /etc/pve/ from any node I have also added some debug logs to pve-ha-lrm, and it was stuck in: (but if /etc/pve is locked, this is normal) if ($fence_request) { $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); $self->set_local_status({ state => 'lost_agent_lock'}); } elsif (!$self->get_protected_ha_agent_lock()) { $self->set_local_status({ state => 'lost_agent_lock'}); } elsif ($self->{mode} eq 'maintenance') { $self->set_local_status({ state => 'maintenance'}); } corosync quorum is currently ok I'm currently digging the logs ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Mardi 15 Septembre 2020 13:04:31 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown also logs of node14, where the lrm was not too long https://gist.github.com/aderumier/a2e2d6afc7e04646c923ae6f37cb6c2d ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 12:15:47 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown here the previous restart log node1 -> corosync restart at 10:46:15 ----- https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7 node2 ----- https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba node5 ------ https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37 ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "dietmar" <dietmar@proxmox.com> Envoyé: Mardi 15 Septembre 2020 11:46:51 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: > Hi, > > I have finally reproduce it ! > > But this is with a corosync restart in cron each 1 minute, on node1 > > Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. > > here the logs with full corosync debug at the time of last corosync restart. > > node1 (where corosync is restarted each minute) > https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e > > node2 > https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 > > node5 > https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 > > I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 > Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 12:49 ` Alexandre DERUMIER @ 2020-09-15 13:00 ` Thomas Lamprecht 2020-09-15 14:09 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-15 13:00 UTC (permalink / raw) To: Alexandre DERUMIER, Proxmox VE development discussion On 9/15/20 2:49 PM, Alexandre DERUMIER wrote: > Hi, > > I have produce it again, > > now I can't write to /etc/pve/ from any node > OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs, not the HA LRM or watchdog mux itself. Can you try to give pmxcfs real time scheduling, e.g., by doing: # systemctl edit pve-cluster And then add snippet: [Service] CPUSchedulingPolicy=rr CPUSchedulingPriority=99 And restart pve-cluster > I have also added some debug logs to pve-ha-lrm, and it was stuck in: > (but if /etc/pve is locked, this is normal) > > if ($fence_request) { > $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); > $self->set_local_status({ state => 'lost_agent_lock'}); > } elsif (!$self->get_protected_ha_agent_lock()) { > $self->set_local_status({ state => 'lost_agent_lock'}); > } elsif ($self->{mode} eq 'maintenance') { > $self->set_local_status({ state => 'maintenance'}); > } > > > corosync quorum is currently ok > > I'm currently digging the logs Is your most simplest/stable reproducer still a periodic restart of corosync in one node? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 13:00 ` Thomas Lamprecht @ 2020-09-15 14:09 ` Alexandre DERUMIER 2020-09-15 14:19 ` Alexandre DERUMIER 2020-09-15 14:32 ` Thomas Lamprecht 0 siblings, 2 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 14:09 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion >> >>Can you try to give pmxcfs real time scheduling, e.g., by doing: >> >># systemctl edit pve-cluster >> >>And then add snippet: >> >> >>[Service] >>CPUSchedulingPolicy=rr >>CPUSchedulingPriority=99 yes, sure, I'll do it now > I'm currently digging the logs >>Is your most simplest/stable reproducer still a periodic restart of corosync in one node? yes, a simple "systemctl restart corosync" on 1 node each minute After 1hour, it's still locked. on other nodes, I still have pmxfs logs like: Sep 15 15:36:31 m6kvm2 pmxcfs[3474]: [status] notice: received log Sep 15 15:46:21 m6kvm2 pmxcfs[3474]: [status] notice: received log Sep 15 15:46:23 m6kvm2 pmxcfs[3474]: [status] notice: received log ... on node1, I just restarted the pve-cluster service with systemctl restart pve-cluster, the pmxcfs process was killed, but not able to start it again and after that the /etc/pve become writable again on others node. (I don't have rebooted yet node1, if you want more test on pmxcfs) root@m6kvm1:~# systemctl status pve-cluster ● pve-cluster.service - The Proxmox VE cluster filesystem Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2020-09-15 15:52:11 CEST; 3min 29s ago Process: 12536 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION) Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5. Sep 15 15:52:11 m6kvm1 systemd[1]: Stopped The Proxmox VE cluster filesystem. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Start request repeated too quickly. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Failed with result 'exit-code'. Sep 15 15:52:11 m6kvm1 systemd[1]: Failed to start The Proxmox VE cluster filesystem. manual "pmxcfs -d" https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service Sep 15 14:38:30 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000064) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000063) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 23 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue length 157 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 4 inode updates) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 31 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service Sep 15 14:39:31 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000065) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000064) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 20 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 9 inode updates) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 25 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service Sep 15 14:40:33 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000066) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000065) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 23 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue length 87 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 6 inode updates) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 33 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service Sep 15 14:41:34 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000067) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000066) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date Sep 15 14:47:54 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:02:55 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:17:56 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:32:57 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:47:58 m6kvm1 pmxcfs[3491]: [status] notice: received log ----> restart 2352 [ 15/09/2020 15:52:00 ] systemctl restart pve-cluster Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] notice: exit proxmox configuration filesystem (-1) Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] notice: exit proxmox configuration filesystem (-1) Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] notice: exit proxmox configuration filesystem (-1) Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] notice: exit proxmox configuration filesystem (-1) Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] notice: exit proxmox configuration filesystem (-1) some interesting dmesg about "pvesr" [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 [Tue Sep 15 14:45:34 2020] Call Trace: [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 15:00:03 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 2:49 PM, Alexandre DERUMIER wrote: > Hi, > > I have produce it again, > > now I can't write to /etc/pve/ from any node > OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs, not the HA LRM or watchdog mux itself. Can you try to give pmxcfs real time scheduling, e.g., by doing: # systemctl edit pve-cluster And then add snippet: [Service] CPUSchedulingPolicy=rr CPUSchedulingPriority=99 And restart pve-cluster > I have also added some debug logs to pve-ha-lrm, and it was stuck in: > (but if /etc/pve is locked, this is normal) > > if ($fence_request) { > $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); > $self->set_local_status({ state => 'lost_agent_lock'}); > } elsif (!$self->get_protected_ha_agent_lock()) { > $self->set_local_status({ state => 'lost_agent_lock'}); > } elsif ($self->{mode} eq 'maintenance') { > $self->set_local_status({ state => 'maintenance'}); > } > > > corosync quorum is currently ok > > I'm currently digging the logs Is your most simplest/stable reproducer still a periodic restart of corosync in one node? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 14:09 ` Alexandre DERUMIER @ 2020-09-15 14:19 ` Alexandre DERUMIER 2020-09-15 14:32 ` Thomas Lamprecht 1 sibling, 0 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 14:19 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion about node1: /etc/pve directory seem to be in bad state, that's why it can't be mount ls -lah /etc/pve: ?????????? ? ? ? ? ? pve I have forced an lazy umount umount -l /etc/pve and now it's working fine. (so maybe when pmxcfs was killed, it don't have cleanly umount /etc/pve ? ) ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:09:50 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >> >>Can you try to give pmxcfs real time scheduling, e.g., by doing: >> >># systemctl edit pve-cluster >> >>And then add snippet: >> >> >>[Service] >>CPUSchedulingPolicy=rr >>CPUSchedulingPriority=99 yes, sure, I'll do it now > I'm currently digging the logs >>Is your most simplest/stable reproducer still a periodic restart of corosync in one node? yes, a simple "systemctl restart corosync" on 1 node each minute After 1hour, it's still locked. on other nodes, I still have pmxfs logs like: Sep 15 15:36:31 m6kvm2 pmxcfs[3474]: [status] notice: received log Sep 15 15:46:21 m6kvm2 pmxcfs[3474]: [status] notice: received log Sep 15 15:46:23 m6kvm2 pmxcfs[3474]: [status] notice: received log ... on node1, I just restarted the pve-cluster service with systemctl restart pve-cluster, the pmxcfs process was killed, but not able to start it again and after that the /etc/pve become writable again on others node. (I don't have rebooted yet node1, if you want more test on pmxcfs) root@m6kvm1:~# systemctl status pve-cluster ● pve-cluster.service - The Proxmox VE cluster filesystem Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2020-09-15 15:52:11 CEST; 3min 29s ago Process: 12536 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION) Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5. Sep 15 15:52:11 m6kvm1 systemd[1]: Stopped The Proxmox VE cluster filesystem. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Start request repeated too quickly. Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Failed with result 'exit-code'. Sep 15 15:52:11 m6kvm1 systemd[1]: Failed to start The Proxmox VE cluster filesystem. manual "pmxcfs -d" https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service Sep 15 14:38:30 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000064) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000063) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 23 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue length 157 Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 4 inode updates) Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 31 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service Sep 15 14:39:31 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000065) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000064) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 20 Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 9 inode updates) Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 25 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service Sep 15 14:40:33 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000066) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000065) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from leader Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue length 23 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue length 87 Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to commit (got 6 inode updates) Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 33 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service Sep 15 14:41:34 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data syncronisation Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 12/3795, 13/3776, 14/3778 Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: starting data syncronisation Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request (epoch 1/3491/00000067) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received sync request (epoch 1/3491/00000066) Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received all states Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date Sep 15 14:47:54 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:02:55 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:17:56 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:32:57 m6kvm1 pmxcfs[3491]: [status] notice: received log Sep 15 15:47:58 m6kvm1 pmxcfs[3491]: [status] notice: received log ----> restart 2352 [ 15/09/2020 15:52:00 ] systemctl restart pve-cluster Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] notice: exit proxmox configuration filesystem (-1) Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] notice: exit proxmox configuration filesystem (-1) Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] notice: exit proxmox configuration filesystem (-1) Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] notice: exit proxmox configuration filesystem (-1) Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] crit: fuse_mount error: Transport endpoint is not connected Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] notice: exit proxmox configuration filesystem (-1) some interesting dmesg about "pvesr" [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 [Tue Sep 15 14:45:34 2020] Call Trace: [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 15:00:03 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 2:49 PM, Alexandre DERUMIER wrote: > Hi, > > I have produce it again, > > now I can't write to /etc/pve/ from any node > OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs, not the HA LRM or watchdog mux itself. Can you try to give pmxcfs real time scheduling, e.g., by doing: # systemctl edit pve-cluster And then add snippet: [Service] CPUSchedulingPolicy=rr CPUSchedulingPriority=99 And restart pve-cluster > I have also added some debug logs to pve-ha-lrm, and it was stuck in: > (but if /etc/pve is locked, this is normal) > > if ($fence_request) { > $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); > $self->set_local_status({ state => 'lost_agent_lock'}); > } elsif (!$self->get_protected_ha_agent_lock()) { > $self->set_local_status({ state => 'lost_agent_lock'}); > } elsif ($self->{mode} eq 'maintenance') { > $self->set_local_status({ state => 'maintenance'}); > } > > > corosync quorum is currently ok > > I'm currently digging the logs Is your most simplest/stable reproducer still a periodic restart of corosync in one node? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 14:09 ` Alexandre DERUMIER 2020-09-15 14:19 ` Alexandre DERUMIER @ 2020-09-15 14:32 ` Thomas Lamprecht 2020-09-15 14:57 ` Alexandre DERUMIER 1 sibling, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-15 14:32 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: Proxmox VE development discussion On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: >>> Can you try to give pmxcfs real time scheduling, e.g., by doing: >>> >>> # systemctl edit pve-cluster >>> >>> And then add snippet: >>> >>> >>> [Service] >>> CPUSchedulingPolicy=rr >>> CPUSchedulingPriority=99 > yes, sure, I'll do it now > > >> I'm currently digging the logs >>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? > yes, a simple "systemctl restart corosync" on 1 node each minute > > > > After 1hour, it's still locked. > > on other nodes, I still have pmxfs logs like: > I mean this is bad, but also great! Cam you do a coredump of the whole thing and upload it somewhere with the version info used (for dbgsym package)? That could help a lot. > manual "pmxcfs -d" > https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e > Hmm, the fuse connection of the previous one got into a weird state (or something is still running) but I'd rather say this is a side-effect not directly connected to the real bug. > > some interesting dmesg about "pvesr" > > [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. > [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 > [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 > [Tue Sep 15 14:45:34 2020] Call Trace: > [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 > [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 > [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 > [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 > [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 > [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 > [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 > [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 > [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 > [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > hmm, hangs in mkdir (cluster wide locking) ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 14:32 ` Thomas Lamprecht @ 2020-09-15 14:57 ` Alexandre DERUMIER 2020-09-15 15:58 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 14:57 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion >>I mean this is bad, but also great! >>Cam you do a coredump of the whole thing and upload it somewhere with the version info >>used (for dbgsym package)? That could help a lot. I'll try to reproduce it again (with the full lock everywhere), and do the coredump. I have tried the real time scheduling, but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking something at next corosync restart.) this time it was blocked at the same time on a node in: work { ... } elsif ($state eq 'active') { .... $self->update_lrm_status(); and another node in if ($fence_request) { $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); $self->set_local_status({ state => 'lost_agent_lock'}); } elsif (!$self->get_protected_ha_agent_lock()) { $self->set_local_status({ state => 'lost_agent_lock'}); } elsif ($self->{mode} eq 'maintenance') { $self->set_local_status({ state => 'maintenance'}); } ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:32:52 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: >>> Can you try to give pmxcfs real time scheduling, e.g., by doing: >>> >>> # systemctl edit pve-cluster >>> >>> And then add snippet: >>> >>> >>> [Service] >>> CPUSchedulingPolicy=rr >>> CPUSchedulingPriority=99 > yes, sure, I'll do it now > > >> I'm currently digging the logs >>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? > yes, a simple "systemctl restart corosync" on 1 node each minute > > > > After 1hour, it's still locked. > > on other nodes, I still have pmxfs logs like: > I mean this is bad, but also great! Cam you do a coredump of the whole thing and upload it somewhere with the version info used (for dbgsym package)? That could help a lot. > manual "pmxcfs -d" > https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e > Hmm, the fuse connection of the previous one got into a weird state (or something is still running) but I'd rather say this is a side-effect not directly connected to the real bug. > > some interesting dmesg about "pvesr" > > [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. > [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 > [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 > [Tue Sep 15 14:45:34 2020] Call Trace: > [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 > [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 > [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 > [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 > [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 > [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 > [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 > [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 > [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 > [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > hmm, hangs in mkdir (cluster wide locking) ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 14:57 ` Alexandre DERUMIER @ 2020-09-15 15:58 ` Alexandre DERUMIER 2020-09-16 7:34 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-15 15:58 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht Another small lock at 17:41:09 To be sure, I have done a small loop of write each second in /etc/pve, node node2. it's hanging at first corosync restart, then, on second corosync restart it's working again. I'll try to improve this tomorrow to be able to debug corosync process - restarting corosync do some write in /etc/pve/ - and if it's hanging don't restart corosync again node2: echo test > /etc/pve/test loop -------------------------------------- Current time : 17:41:01 Current time : 17:41:02 Current time : 17:41:03 Current time : 17:41:04 Current time : 17:41:05 Current time : 17:41:06 Current time : 17:41:07 Current time : 17:41:08 Current time : 17:41:09 hang Current time : 17:42:05 Current time : 17:42:06 Current time : 17:42:07 node1 ----- Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:08 m6kvm1 corosync[18145]: [TOTEM ] Knet pMTU change: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Global data MTU changed to: 1397 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-31-zx6KJM/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] lib_init_fn: conn=0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got quorum_type request on 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got trackstart request on 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending initial status to 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending quorum notification to 0x556c2918d5f0, length = 52 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CMAP ] lib_init_fn: conn=0x556c2918ef20 Sep 15 17:41:09 m6kvm1 pmxcfs[16239]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-33-6RKbvH/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918ad00, cpd=0x556c2918b50c Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-34-GAY5T9/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918c740, cpd=0x556c2918ce8c Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Creating commit token because I am the rep. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Saving state aru 5 high seq received 5 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Storing new sequence id for ring 1197 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering COMMIT state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering RECOVERY state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] TRANS [0] member 1: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [0] member 1: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (1.1193) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 5 high delivered 5 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [1] member 2: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [2] member 3: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [3] member 4: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) ep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [4] member 5: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [5] member 6: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [6] member 7: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [7] member 8: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [8] member 9: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [9] member 10: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [10] member 11: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [11] member 12: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [12] member 13: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [13] member 14: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Did not need to originate any messages in recovery. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Sending initial ORF token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Resetting old ring state Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] recovery to regular 1-0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] waiting_trans_ack changed to 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.90) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.91) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.92) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.93) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.94) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.95) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.96) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.97) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.107) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.108) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.109) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.110) ep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.111) Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] call init for locally known services Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering OPERATIONAL state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 2 3 4 5 6 7 8 9 10 11 12 13 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] enter sync process Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync configuration map access Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 3 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 4 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 5 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 6 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 7 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 8 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 9 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 10 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 11 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 12 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] my downlist: members(old:1 left:0) Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=2, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=4, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 3 .... .... next corosync restart Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] Node was shut down by a signal Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Unloading all Corosync service engines. Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] withdrawing server sockets Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_unref() - destroying Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_disconnect(/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) state:2 Sep 15 17:42:03 m6kvm1 pmxcfs[16239]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_closed() Sep 15 17:42:03 m6kvm1 corosync[18145]: [CMAP ] exit_fn for conn=0x556c2918ef20 Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_destroyed() node2 ----- Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 Sep 15 17:41:07 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 received pong: 2 Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [TOTEM ] entering GATHER state from 11(merge during join). Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 is up Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Knet host change callback. nodeid: 1 reachable: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 1 link: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Saving state aru 123 high seq received 123 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Storing new sequence id for ring 1197 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering COMMIT state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering RECOVERY state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [0] member 2: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [1] member 3: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [2] member 4: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [3] member 5: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [4] member 6: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [5] member 7: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [6] member 8: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [7] member 9: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [8] member 10: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [9] member 11: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [10] member 12: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [11] member 13: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [12] member 14: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [0] member 1: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (1.1193) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 5 high delivered 5 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [1] member 2: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [2] member 3: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [3] member 4: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [4] member 5: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [5] member 6: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [6] member 7: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [7] member 8: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [8] member 9: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [9] member 10: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [10] member 11: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [11] member 12: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [12] member 13: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [13] member 14: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Did not need to originate any messages in recovery. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Resetting old ring state Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] recovery to regular 1-0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Member joined: r(0) ip(10.3.94.89) Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] call init for locally known services Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering OPERATIONAL state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] enter sync process Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync configuration map access Sep 15 17:41:09 m6kvm2 corosync[25411]: [CMAP ] Not first sync -> no action Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] my downlist: members(old:13 left:0) Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[4]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[5]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[6]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[7]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[9]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[10]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[11]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[12]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync vote quorum service v1.0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] sending quorum notification to (nil), length = 104 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending quorum callback, quorate = 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Completed service synchronization, ready to provide service. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:57:46 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>I mean this is bad, but also great! >>Cam you do a coredump of the whole thing and upload it somewhere with the version info >>used (for dbgsym package)? That could help a lot. I'll try to reproduce it again (with the full lock everywhere), and do the coredump. I have tried the real time scheduling, but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking something at next corosync restart.) this time it was blocked at the same time on a node in: work { ... } elsif ($state eq 'active') { .... $self->update_lrm_status(); and another node in if ($fence_request) { $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); $self->set_local_status({ state => 'lost_agent_lock'}); } elsif (!$self->get_protected_ha_agent_lock()) { $self->set_local_status({ state => 'lost_agent_lock'}); } elsif ($self->{mode} eq 'maintenance') { $self->set_local_status({ state => 'maintenance'}); } ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:32:52 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: >>> Can you try to give pmxcfs real time scheduling, e.g., by doing: >>> >>> # systemctl edit pve-cluster >>> >>> And then add snippet: >>> >>> >>> [Service] >>> CPUSchedulingPolicy=rr >>> CPUSchedulingPriority=99 > yes, sure, I'll do it now > > >> I'm currently digging the logs >>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? > yes, a simple "systemctl restart corosync" on 1 node each minute > > > > After 1hour, it's still locked. > > on other nodes, I still have pmxfs logs like: > I mean this is bad, but also great! Cam you do a coredump of the whole thing and upload it somewhere with the version info used (for dbgsym package)? That could help a lot. > manual "pmxcfs -d" > https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e > Hmm, the fuse connection of the previous one got into a weird state (or something is still running) but I'd rather say this is a side-effect not directly connected to the real bug. > > some interesting dmesg about "pvesr" > > [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. > [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 > [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 > [Tue Sep 15 14:45:34 2020] Call Trace: > [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 > [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 > [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 > [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 > [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 > [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 > [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 > [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 > [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 > [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > hmm, hangs in mkdir (cluster wide locking) _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 15:58 ` Alexandre DERUMIER @ 2020-09-16 7:34 ` Alexandre DERUMIER 2020-09-16 7:58 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-16 7:34 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht Hi, I have produced the problem now, and this time I don't have restarted corosync a second after the lock of /etc/pve so, currently it's readonly. I don't have used gdb since a long time, could you tell me how to attach to the running pmxcfs process, and some gbd commands to launch ? ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Mardi 15 Septembre 2020 17:58:33 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Another small lock at 17:41:09 To be sure, I have done a small loop of write each second in /etc/pve, node node2. it's hanging at first corosync restart, then, on second corosync restart it's working again. I'll try to improve this tomorrow to be able to debug corosync process - restarting corosync do some write in /etc/pve/ - and if it's hanging don't restart corosync again node2: echo test > /etc/pve/test loop -------------------------------------- Current time : 17:41:01 Current time : 17:41:02 Current time : 17:41:03 Current time : 17:41:04 Current time : 17:41:05 Current time : 17:41:06 Current time : 17:41:07 Current time : 17:41:08 Current time : 17:41:09 hang Current time : 17:42:05 Current time : 17:42:06 Current time : 17:42:07 node1 ----- Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:08 m6kvm1 corosync[18145]: [TOTEM ] Knet pMTU change: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Global data MTU changed to: 1397 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-31-zx6KJM/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] lib_init_fn: conn=0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got quorum_type request on 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got trackstart request on 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending initial status to 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending quorum notification to 0x556c2918d5f0, length = 52 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CMAP ] lib_init_fn: conn=0x556c2918ef20 Sep 15 17:41:09 m6kvm1 pmxcfs[16239]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-33-6RKbvH/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918ad00, cpd=0x556c2918b50c Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-34-GAY5T9/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918c740, cpd=0x556c2918ce8c Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Creating commit token because I am the rep. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Saving state aru 5 high seq received 5 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Storing new sequence id for ring 1197 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering COMMIT state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering RECOVERY state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] TRANS [0] member 1: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [0] member 1: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (1.1193) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 5 high delivered 5 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [1] member 2: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [2] member 3: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [3] member 4: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) ep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [4] member 5: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [5] member 6: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [6] member 7: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [7] member 8: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [8] member 9: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [9] member 10: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [10] member 11: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [11] member 12: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [12] member 13: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [13] member 14: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Did not need to originate any messages in recovery. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Sending initial ORF token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Resetting old ring state Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] recovery to regular 1-0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] waiting_trans_ack changed to 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.90) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.91) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.92) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.93) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.94) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.95) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.96) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.97) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.107) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.108) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.109) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.110) ep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.111) Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] call init for locally known services Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering OPERATIONAL state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 2 3 4 5 6 7 8 9 10 11 12 13 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] enter sync process Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync configuration map access Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 3 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 4 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 5 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 6 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 7 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 8 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 9 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 10 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 11 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 12 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] my downlist: members(old:1 left:0) Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=2, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=4, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 3 .... .... next corosync restart Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] Node was shut down by a signal Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Unloading all Corosync service engines. Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] withdrawing server sockets Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_unref() - destroying Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_disconnect(/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) state:2 Sep 15 17:42:03 m6kvm1 pmxcfs[16239]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_closed() Sep 15 17:42:03 m6kvm1 corosync[18145]: [CMAP ] exit_fn for conn=0x556c2918ef20 Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_destroyed() node2 ----- Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 Sep 15 17:41:07 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 received pong: 2 Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [TOTEM ] entering GATHER state from 11(merge during join). Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 is up Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Knet host change callback. nodeid: 1 reachable: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 1 link: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Saving state aru 123 high seq received 123 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Storing new sequence id for ring 1197 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering COMMIT state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering RECOVERY state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [0] member 2: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [1] member 3: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [2] member 4: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [3] member 5: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [4] member 6: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [5] member 7: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [6] member 8: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [7] member 9: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [8] member 10: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [9] member 11: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [10] member 12: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [11] member 13: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [12] member 14: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [0] member 1: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (1.1193) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 5 high delivered 5 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [1] member 2: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [2] member 3: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [3] member 4: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [4] member 5: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [5] member 6: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [6] member 7: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [7] member 8: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [8] member 9: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [9] member 10: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [10] member 11: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [11] member 12: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [12] member 13: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [13] member 14: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Did not need to originate any messages in recovery. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Resetting old ring state Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] recovery to regular 1-0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Member joined: r(0) ip(10.3.94.89) Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] call init for locally known services Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering OPERATIONAL state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] enter sync process Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync configuration map access Sep 15 17:41:09 m6kvm2 corosync[25411]: [CMAP ] Not first sync -> no action Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] my downlist: members(old:13 left:0) Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[4]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[5]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[6]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[7]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[9]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[10]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[11]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[12]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync vote quorum service v1.0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] sending quorum notification to (nil), length = 104 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending quorum callback, quorate = 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Completed service synchronization, ready to provide service. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:57:46 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>I mean this is bad, but also great! >>Cam you do a coredump of the whole thing and upload it somewhere with the version info >>used (for dbgsym package)? That could help a lot. I'll try to reproduce it again (with the full lock everywhere), and do the coredump. I have tried the real time scheduling, but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking something at next corosync restart.) this time it was blocked at the same time on a node in: work { ... } elsif ($state eq 'active') { .... $self->update_lrm_status(); and another node in if ($fence_request) { $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); $self->set_local_status({ state => 'lost_agent_lock'}); } elsif (!$self->get_protected_ha_agent_lock()) { $self->set_local_status({ state => 'lost_agent_lock'}); } elsif ($self->{mode} eq 'maintenance') { $self->set_local_status({ state => 'maintenance'}); } ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:32:52 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: >>> Can you try to give pmxcfs real time scheduling, e.g., by doing: >>> >>> # systemctl edit pve-cluster >>> >>> And then add snippet: >>> >>> >>> [Service] >>> CPUSchedulingPolicy=rr >>> CPUSchedulingPriority=99 > yes, sure, I'll do it now > > >> I'm currently digging the logs >>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? > yes, a simple "systemctl restart corosync" on 1 node each minute > > > > After 1hour, it's still locked. > > on other nodes, I still have pmxfs logs like: > I mean this is bad, but also great! Cam you do a coredump of the whole thing and upload it somewhere with the version info used (for dbgsym package)? That could help a lot. > manual "pmxcfs -d" > https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e > Hmm, the fuse connection of the previous one got into a weird state (or something is still running) but I'd rather say this is a side-effect not directly connected to the real bug. > > some interesting dmesg about "pvesr" > > [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. > [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 > [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 > [Tue Sep 15 14:45:34 2020] Call Trace: > [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 > [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 > [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 > [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 > [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 > [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 > [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 > [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 > [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 > [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > hmm, hangs in mkdir (cluster wide locking) _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-16 7:34 ` Alexandre DERUMIER @ 2020-09-16 7:58 ` Alexandre DERUMIER 2020-09-16 8:30 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-16 7:58 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht here a backtrace with pve-cluster-dbgsym installed (gdb) bt full #0 0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce67721988 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #2 0x00007fce678c5f98 in fuse_session_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678cb577 in fuse_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00005617f6d5ab0e in main (argc=<optimized out>, argv=<optimized out>) at pmxcfs.c:1055 ret = -1 lockfd = <optimized out> pipefd = {8, 9} foreground = 0 force_local_mode = 0 wrote_pidfile = 1 memdb = 0x5617f7c563b0 dcdb = 0x5617f8046ca0 status_fsm = 0x5617f806a630 context = <optimized out> entries = {{long_name = 0x5617f6d7440f "debug", short_name = 100 'd', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x5617f6d82104 <cfs+20>, description = 0x5617f6d742cb "Turn on debug messages", arg_description = 0x0}, { long_name = 0x5617f6d742e2 "foreground", short_name = 102 'f', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc38, description = 0x5617f6d742ed "Do not daemonize server", arg_description = 0x0}, { long_name = 0x5617f6d74305 "local", short_name = 108 'l', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc3c, description = 0x5617f6d746a0 "Force local mode (ignore corosync.conf, force quorum)", arg_description = 0x0}, {long_name = 0x0, short_name = 0 '\000', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x0, description = 0x0, arg_description = 0x0}} err = 0x0 __func__ = "main" utsname = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "m6kvm1", '\000' <repeats 58 times>, release = "5.4.60-1-pve", '\000' <repeats 52 times>, version = "#1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200)", '\000' <repeats 11 times>, machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>} dot = <optimized out> www_data = <optimized out> create = <optimized out> conf_data = 0x5617f80466a0 len = <optimized out> config = <optimized out> bplug = <optimized out> fa = {0x5617f6d7441e "-f", 0x5617f6d74421 "-odefault_permissions", 0x5617f6d74437 "-oallow_other", 0x0} fuse_args = {argc = 1, argv = 0x5617f8046960, allocated = 1} fuse_chan = 0x5617f7c560c0 corosync_loop = 0x5617f80481d0 service_quorum = 0x5617f8048460 service_confdb = 0x5617f8069da0 service_dcdb = 0x5617f806a5d0 service_status = 0x5617f806a8e0 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Mercredi 16 Septembre 2020 09:34:27 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Hi, I have produced the problem now, and this time I don't have restarted corosync a second after the lock of /etc/pve so, currently it's readonly. I don't have used gdb since a long time, could you tell me how to attach to the running pmxcfs process, and some gbd commands to launch ? ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Mardi 15 Septembre 2020 17:58:33 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Another small lock at 17:41:09 To be sure, I have done a small loop of write each second in /etc/pve, node node2. it's hanging at first corosync restart, then, on second corosync restart it's working again. I'll try to improve this tomorrow to be able to debug corosync process - restarting corosync do some write in /etc/pve/ - and if it's hanging don't restart corosync again node2: echo test > /etc/pve/test loop -------------------------------------- Current time : 17:41:01 Current time : 17:41:02 Current time : 17:41:03 Current time : 17:41:04 Current time : 17:41:05 Current time : 17:41:06 Current time : 17:41:07 Current time : 17:41:08 Current time : 17:41:09 hang Current time : 17:42:05 Current time : 17:42:06 Current time : 17:42:07 node1 ----- Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:08 m6kvm1 corosync[18145]: [TOTEM ] Knet pMTU change: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Global data MTU changed to: 1397 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-31-zx6KJM/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] lib_init_fn: conn=0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got quorum_type request on 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got trackstart request on 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending initial status to 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending quorum notification to 0x556c2918d5f0, length = 52 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CMAP ] lib_init_fn: conn=0x556c2918ef20 Sep 15 17:41:09 m6kvm1 pmxcfs[16239]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-33-6RKbvH/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918ad00, cpd=0x556c2918b50c Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-34-GAY5T9/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918c740, cpd=0x556c2918ce8c Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Creating commit token because I am the rep. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Saving state aru 5 high seq received 5 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Storing new sequence id for ring 1197 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering COMMIT state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering RECOVERY state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] TRANS [0] member 1: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [0] member 1: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (1.1193) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 5 high delivered 5 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [1] member 2: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [2] member 3: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [3] member 4: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) ep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [4] member 5: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [5] member 6: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [6] member 7: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [7] member 8: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [8] member 9: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [9] member 10: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [10] member 11: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [11] member 12: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [12] member 13: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [13] member 14: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Did not need to originate any messages in recovery. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Sending initial ORF token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Resetting old ring state Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] recovery to regular 1-0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] waiting_trans_ack changed to 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.90) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.91) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.92) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.93) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.94) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.95) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.96) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.97) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.107) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.108) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.109) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.110) ep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.111) Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] call init for locally known services Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering OPERATIONAL state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 2 3 4 5 6 7 8 9 10 11 12 13 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] enter sync process Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync configuration map access Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 3 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 4 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 5 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 6 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 7 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 8 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 9 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 10 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 11 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 12 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] my downlist: members(old:1 left:0) Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=2, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=4, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 3 .... .... next corosync restart Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] Node was shut down by a signal Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Unloading all Corosync service engines. Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] withdrawing server sockets Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_unref() - destroying Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_disconnect(/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) state:2 Sep 15 17:42:03 m6kvm1 pmxcfs[16239]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_closed() Sep 15 17:42:03 m6kvm1 corosync[18145]: [CMAP ] exit_fn for conn=0x556c2918ef20 Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_destroyed() node2 ----- Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 Sep 15 17:41:07 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 received pong: 2 Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [TOTEM ] entering GATHER state from 11(merge during join). Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 is up Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Knet host change callback. nodeid: 1 reachable: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 1 link: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Saving state aru 123 high seq received 123 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Storing new sequence id for ring 1197 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering COMMIT state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering RECOVERY state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [0] member 2: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [1] member 3: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [2] member 4: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [3] member 5: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [4] member 6: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [5] member 7: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [6] member 8: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [7] member 9: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [8] member 10: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [9] member 11: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [10] member 12: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [11] member 13: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [12] member 14: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [0] member 1: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (1.1193) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 5 high delivered 5 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [1] member 2: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [2] member 3: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [3] member 4: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [4] member 5: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [5] member 6: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [6] member 7: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [7] member 8: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [8] member 9: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [9] member 10: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [10] member 11: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [11] member 12: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [12] member 13: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [13] member 14: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Did not need to originate any messages in recovery. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Resetting old ring state Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] recovery to regular 1-0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Member joined: r(0) ip(10.3.94.89) Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] call init for locally known services Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering OPERATIONAL state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] enter sync process Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync configuration map access Sep 15 17:41:09 m6kvm2 corosync[25411]: [CMAP ] Not first sync -> no action Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] my downlist: members(old:13 left:0) Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[4]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[5]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[6]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[7]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[9]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[10]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[11]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[12]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync vote quorum service v1.0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] sending quorum notification to (nil), length = 104 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending quorum callback, quorate = 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Completed service synchronization, ready to provide service. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:57:46 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>I mean this is bad, but also great! >>Cam you do a coredump of the whole thing and upload it somewhere with the version info >>used (for dbgsym package)? That could help a lot. I'll try to reproduce it again (with the full lock everywhere), and do the coredump. I have tried the real time scheduling, but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking something at next corosync restart.) this time it was blocked at the same time on a node in: work { ... } elsif ($state eq 'active') { .... $self->update_lrm_status(); and another node in if ($fence_request) { $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); $self->set_local_status({ state => 'lost_agent_lock'}); } elsif (!$self->get_protected_ha_agent_lock()) { $self->set_local_status({ state => 'lost_agent_lock'}); } elsif ($self->{mode} eq 'maintenance') { $self->set_local_status({ state => 'maintenance'}); } ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:32:52 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: >>> Can you try to give pmxcfs real time scheduling, e.g., by doing: >>> >>> # systemctl edit pve-cluster >>> >>> And then add snippet: >>> >>> >>> [Service] >>> CPUSchedulingPolicy=rr >>> CPUSchedulingPriority=99 > yes, sure, I'll do it now > > >> I'm currently digging the logs >>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? > yes, a simple "systemctl restart corosync" on 1 node each minute > > > > After 1hour, it's still locked. > > on other nodes, I still have pmxfs logs like: > I mean this is bad, but also great! Cam you do a coredump of the whole thing and upload it somewhere with the version info used (for dbgsym package)? That could help a lot. > manual "pmxcfs -d" > https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e > Hmm, the fuse connection of the previous one got into a weird state (or something is still running) but I'd rather say this is a side-effect not directly connected to the real bug. > > some interesting dmesg about "pvesr" > > [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. > [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 > [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 > [Tue Sep 15 14:45:34 2020] Call Trace: > [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 > [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 > [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 > [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 > [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 > [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 > [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 > [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 > [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 > [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > hmm, hangs in mkdir (cluster wide locking) _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-16 7:58 ` Alexandre DERUMIER @ 2020-09-16 8:30 ` Alexandre DERUMIER 2020-09-16 8:53 ` Alexandre DERUMIER [not found] ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com> 0 siblings, 2 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-16 8:30 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht backtrace of all threads. (I don't have libqb or fuse debug symbol) (gdb) info threads Id Target Id Frame * 1 Thread 0x7fce63d5f900 (LWP 16239) "pmxcfs" 0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0 2 Thread 0x7fce63ce6700 (LWP 16240) "cfs_loop" 0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 3 Thread 0x7fce618c9700 (LWP 16246) "server" 0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 4 Thread 0x7fce46beb700 (LWP 30256) "pmxcfs" 0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 5 Thread 0x7fce2abf9700 (LWP 43943) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 6 Thread 0x7fce610a6700 (LWP 13346) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 7 Thread 0x7fce2a3f8700 (LWP 8832) "pmxcfs" 0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 8 Thread 0x7fce28bf5700 (LWP 3464) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 9 Thread 0x7fce453e8700 (LWP 3727) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 10 Thread 0x7fce2bfff700 (LWP 6705) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 11 Thread 0x7fce293f6700 (LWP 41454) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 12 Thread 0x7fce45be9700 (LWP 17734) "pmxcfs" 0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 13 Thread 0x7fce2b7fe700 (LWP 17762) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 14 Thread 0x7fce463ea700 (LWP 2347) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 15 Thread 0x7fce44be7700 (LWP 11335) "pmxcfs" 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) thread 1 [Switching to thread 1 (Thread 0x7fce63d5f900 (LWP 16239))] #0 0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce67721988 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #2 0x00007fce678c5f98 in fuse_session_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678cb577 in fuse_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00005617f6d5ab0e in main (argc=<optimized out>, argv=<optimized out>) at pmxcfs.c:1055 ret = -1 lockfd = <optimized out> pipefd = {8, 9} foreground = 0 force_local_mode = 0 wrote_pidfile = 1 memdb = 0x5617f7c563b0 dcdb = 0x5617f8046ca0 status_fsm = 0x5617f806a630 context = <optimized out> entries = {{long_name = 0x5617f6d7440f "debug", short_name = 100 'd', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x5617f6d82104 <cfs+20>, description = 0x5617f6d742cb "Turn on debug messages", arg_description = 0x0}, { long_name = 0x5617f6d742e2 "foreground", short_name = 102 'f', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc38, description = 0x5617f6d742ed "Do not daemonize server", arg_description = 0x0}, { long_name = 0x5617f6d74305 "local", short_name = 108 'l', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc3c, description = 0x5617f6d746a0 "Force local mode (ignore corosync.conf, force quorum)", arg_description = 0x0}, {long_name = 0x0, short_name = 0 '\000', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x0, description = 0x0, arg_description = 0x0}} err = 0x0 __func__ = "main" utsname = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "m6kvm1", '\000' <repeats 58 times>, release = "5.4.60-1-pve", '\000' <repeats 52 times>, version = "#1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200)", '\000' <repeats 11 times>, machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>} dot = <optimized out> www_data = <optimized out> create = <optimized out> conf_data = 0x5617f80466a0 len = <optimized out> config = <optimized out> bplug = <optimized out> fa = {0x5617f6d7441e "-f", 0x5617f6d74421 "-odefault_permissions", 0x5617f6d74437 "-oallow_other", 0x0} fuse_args = {argc = 1, argv = 0x5617f8046960, allocated = 1} fuse_chan = 0x5617f7c560c0 corosync_loop = 0x5617f80481d0 service_quorum = 0x5617f8048460 service_confdb = 0x5617f8069da0 service_dcdb = 0x5617f806a5d0 service_status = 0x5617f806a8e0 (gdb) thread 2 [Switching to thread 2 (Thread 0x7fce63ce6700 (LWP 16240))] #0 0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt full #0 0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #1 0x00007fce67500b2a in ?? () from /lib/x86_64-linux-gnu/libqb.so.0 No symbol table info available. #2 0x00007fce674f1bb0 in qb_loop_run () from /lib/x86_64-linux-gnu/libqb.so.0 No symbol table info available. #3 0x00005617f6d5cd31 in cfs_loop_worker_thread (data=0x5617f80481d0) at loop.c:330 __func__ = "cfs_loop_worker_thread" loop = 0x5617f80481d0 qbloop = 0x5617f8048230 l = <optimized out> ctime = <optimized out> th = 7222815479134420993 #4 0x00007fce67968415 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 No symbol table info available. #5 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #6 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) bt full #0 0x00007fce676497ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #1 0x00007fce67500b2a in ?? () from /lib/x86_64-linux-gnu/libqb.so.0 No symbol table info available. #2 0x00007fce674f1bb0 in qb_loop_run () from /lib/x86_64-linux-gnu/libqb.so.0 No symbol table info available. #3 0x00005617f6d5e453 in worker_thread (data=<optimized out>) at server.c:529 th = 5863405772435619840 __func__ = "worker_thread" _g_boolean_var_ = <optimized out> #4 0x00007fce67968415 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 No symbol table info available. #5 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #6 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) bt full #0 0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #1 0x00007fce67989f9f in g_cond_wait () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 No symbol table info available. #2 0x00005617f6d67f14 in dfsm_send_message_sync (dfsm=dfsm@entry=0x5617f8046ca0, msgtype=msgtype@entry=5, iov=iov@entry=0x7fce46bea870, len=len@entry=8, rp=rp@entry=0x7fce46bea860) at dfsm.c:339 __func__ = "dfsm_send_message_sync" msgcount = 189465 header = {base = {type = 0, subtype = 5, protocol_version = 1, time = 1600241342, reserved = 0}, count = 189465} real_iov = {{iov_base = 0x7fce46bea7e0, iov_len = 24}, {iov_base = 0x7fce46bea84c, iov_len = 4}, {iov_base = 0x7fce46bea940, iov_len = 4}, {iov_base = 0x7fce46bea858, iov_len = 4}, {iov_base = 0x7fce46bea85c, iov_len = 4}, { iov_base = 0x7fce46bea948, iov_len = 4}, {iov_base = 0x7fce1c002980, iov_len = 34}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}} result = CS_OK #3 0x00005617f6d6491e in dcdb_send_fuse_message (dfsm=0x5617f8046ca0, msg_type=msg_type@entry=DCDB_MESSAGE_CFS_CREATE, path=0x7fce1c002980 "nodes/m6kvm1/lrm_status.tmp.42080", to=to@entry=0x0, buf=buf@entry=0x0, size=<optimized out>, size@entry=0, offset=<optimized out>, flags=<optimized out>) at dcdb.c:157 iov = {{iov_base = 0x7fce46bea84c, iov_len = 4}, {iov_base = 0x7fce46bea940, iov_len = 4}, {iov_base = 0x7fce46bea858, iov_len = 4}, {iov_base = 0x7fce46bea85c, iov_len = 4}, {iov_base = 0x7fce46bea948, iov_len = 4}, { iov_base = 0x7fce1c002980, iov_len = 34}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}} pathlen = 34 tolen = 0 rc = {msgcount = 189465, result = -16, processed = 0} #4 0x00005617f6d6b640 in cfs_plug_memdb_create (mode=<optimized out>, fi=<optimized out>, path=<optimized out>, plug=0x5617f8048b70) at cfs-plug-memdb.c:307 res = <optimized out> mdb = 0x5617f8048b70 __func__ = "cfs_plug_memdb_create" res = <optimized out> mdb = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> ctime = <optimized out> #5 cfs_plug_memdb_create (plug=0x5617f8048b70, path=<optimized out>, mode=<optimized out>, fi=<optimized out>) at cfs-plug-memdb.c:291 res = <optimized out> mdb = <optimized out> __func__ = "cfs_plug_memdb_create" _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> ctime = <optimized out> #6 0x00005617f6d5ca5d in cfs_fuse_create (path=0x7fce1c000bf0 "/nodes/m6kvm1/lrm_status.tmp.42080", mode=33188, fi=0x7fce46beab30) at pmxcfs.c:415 __func__ = "cfs_fuse_create" ret = -13 subpath = 0x7fce1c002980 "nodes/m6kvm1/lrm_status.tmp.42080" plug = <optimized out> #7 0x00007fce678c152b in fuse_fs_create () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #8 0x00007fce678c165a in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #9 0x00007fce678c7bb7 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #10 0x00007fce678c90b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #11 0x00007fce678c5d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #12 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #13 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 5 [Switching to thread 5 (Thread 0x7fce2abf9700 (LWP 43943))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. gdb) thread 6 [Switching to thread 6 (Thread 0x7fce610a6700 (LWP 13346))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 7 [Switching to thread 7 (Thread 0x7fce2a3f8700 (LWP 8832))] #0 0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt full #0 0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #1 0x00007fce67989f9f in g_cond_wait () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 No symbol table info available. #2 0x00005617f6d67f14 in dfsm_send_message_sync (dfsm=dfsm@entry=0x5617f8046ca0, msgtype=msgtype@entry=5, iov=iov@entry=0x7fce2a3f7870, len=len@entry=8, rp=rp@entry=0x7fce2a3f7860) at dfsm.c:339 __func__ = "dfsm_send_message_sync" msgcount = 189466 header = {base = {type = 0, subtype = 5, protocol_version = 1, time = 1600242491, reserved = 0}, count = 189466} real_iov = {{iov_base = 0x7fce2a3f77e0, iov_len = 24}, {iov_base = 0x7fce2a3f784c, iov_len = 4}, {iov_base = 0x7fce2a3f7940, iov_len = 4}, {iov_base = 0x7fce2a3f7858, iov_len = 4}, {iov_base = 0x7fce2a3f785c, iov_len = 4}, { iov_base = 0x7fce2a3f7948, iov_len = 4}, {iov_base = 0x7fce1406f350, iov_len = 10}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}} result = CS_OK #3 0x00005617f6d6491e in dcdb_send_fuse_message (dfsm=0x5617f8046ca0, msg_type=msg_type@entry=DCDB_MESSAGE_CFS_CREATE, path=0x7fce1406f350 "testalex2", to=to@entry=0x0, buf=buf@entry=0x0, size=<optimized out>, size@entry=0, offset=<optimized out>, flags=<optimized out>) at dcdb.c:157 iov = {{iov_base = 0x7fce2a3f784c, iov_len = 4}, {iov_base = 0x7fce2a3f7940, iov_len = 4}, {iov_base = 0x7fce2a3f7858, iov_len = 4}, {iov_base = 0x7fce2a3f785c, iov_len = 4}, {iov_base = 0x7fce2a3f7948, iov_len = 4}, { iov_base = 0x7fce1406f350, iov_len = 10}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}} pathlen = 10 tolen = 0 rc = {msgcount = 189466, result = -16, processed = 0} #4 0x00005617f6d6b640 in cfs_plug_memdb_create (mode=<optimized out>, fi=<optimized out>, path=<optimized out>, plug=0x5617f8048b70) at cfs-plug-memdb.c:307 res = <optimized out> mdb = 0x5617f8048b70 __func__ = "cfs_plug_memdb_create" res = <optimized out> mdb = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> ctime = <optimized out> #5 cfs_plug_memdb_create (plug=0x5617f8048b70, path=<optimized out>, mode=<optimized out>, fi=<optimized out>) at cfs-plug-memdb.c:291 res = <optimized out> mdb = <optimized out> __func__ = "cfs_plug_memdb_create" _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> _g_boolean_var_ = <optimized out> ctime = <optimized out> #6 0x00005617f6d5ca5d in cfs_fuse_create (path=0x7fce14022ff0 "/testalex2", mode=33188, fi=0x7fce2a3f7b30) at pmxcfs.c:415 __func__ = "cfs_fuse_create" ret = -13 subpath = 0x7fce1406f350 "testalex2" plug = <optimized out> #7 0x00007fce678c152b in fuse_fs_create () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #8 0x00007fce678c165a in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #9 0x00007fce678c7bb7 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #10 0x00007fce678c90b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #11 0x00007fce678c5d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #12 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #13 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 8 [Switching to thread 8 (Thread 0x7fce28bf5700 (LWP 3464))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 9 [Switching to thread 9 (Thread 0x7fce453e8700 (LWP 3727))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 10 [Switching to thread 10 (Thread 0x7fce2bfff700 (LWP 6705))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 11 [Switching to thread 11 (Thread 0x7fce293f6700 (LWP 41454))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 12 [Switching to thread 12 (Thread 0x7fce45be9700 (LWP 17734))] #0 0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt full #0 0x00007fce67643f59 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #1 0x00007fce67989f9f in g_cond_wait () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 No symbol table info available. #2 0x00005617f6d67f14 in dfsm_send_message_sync (dfsm=dfsm@entry=0x5617f8046ca0, msgtype=msgtype@entry=2, iov=iov@entry=0x7fce45be88e0, len=len@entry=8, rp=rp@entry=0x7fce45be88d0) at dfsm.c:339 __func__ = "dfsm_send_message_sync" msgcount = 189464 header = {base = {type = 0, subtype = 2, protocol_version = 1, time = 1600241340, reserved = 0}, count = 189464} real_iov = {{iov_base = 0x7fce45be8850, iov_len = 24}, {iov_base = 0x7fce45be88bc, iov_len = 4}, {iov_base = 0x7fce45be89b0, iov_len = 4}, {iov_base = 0x7fce45be88c8, iov_len = 4}, {iov_base = 0x7fce45be88cc, iov_len = 4}, { iov_base = 0x7fce45be89b8, iov_len = 4}, {iov_base = 0x7fce34001dd0, iov_len = 31}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}} result = CS_OK #3 0x00005617f6d6491e in dcdb_send_fuse_message (dfsm=0x5617f8046ca0, msg_type=msg_type@entry=DCDB_MESSAGE_CFS_MKDIR, path=0x7fce34001dd0 "priv/lock/file-replication_cfg", to=to@entry=0x0, buf=buf@entry=0x0, size=<optimized out>, size@entry=0, offset=<optimized out>, flags=<optimized out>) at dcdb.c:157 iov = {{iov_base = 0x7fce45be88bc, iov_len = 4}, {iov_base = 0x7fce45be89b0, iov_len = 4}, {iov_base = 0x7fce45be88c8, iov_len = 4}, {iov_base = 0x7fce45be88cc, iov_len = 4}, {iov_base = 0x7fce45be89b8, iov_len = 4}, { iov_base = 0x7fce34001dd0, iov_len = 31}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}} pathlen = 31 tolen = 0 rc = {msgcount = 189464, result = -16, processed = 0} #4 0x00005617f6d6b0e7 in cfs_plug_memdb_mkdir (plug=0x5617f8048b70, path=<optimized out>, mode=<optimized out>) at cfs-plug-memdb.c:329 __func__ = "cfs_plug_memdb_mkdir" res = <optimized out> mdb = 0x5617f8048b70 #5 0x00005617f6d5bc70 in cfs_fuse_mkdir (path=0x7fce34002240 "/priv/lock/file-replication_cfg", mode=493) at pmxcfs.c:238 __func__ = "cfs_fuse_mkdir" ret = -13 subpath = 0x7fce34001dd0 "priv/lock/file-replication_cfg" plug = <optimized out> #6 0x00007fce678c2ccb in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #7 0x00007fce678c90b8 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #8 0x00007fce678c5d5c in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #9 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #10 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 13 [Switching to thread 13 (Thread 0x7fce2b7fe700 (LWP 17762))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 14 [Switching to thread 14 (Thread 0x7fce463ea700 (LWP 2347))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. (gdb) thread 15 [Switching to thread 15 (Thread 0x7fce44be7700 (LWP 11335))] #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt full #0 0x00007fce67722544 in read () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce678c56d2 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #2 0x00007fce678c7249 in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678c5cdf in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00007fce67718fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #5 0x00007fce676494cf in clone () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Mercredi 16 Septembre 2020 09:58:02 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown here a backtrace with pve-cluster-dbgsym installed (gdb) bt full #0 0x00007fce67721896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #1 0x00007fce67721988 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0 No symbol table info available. #2 0x00007fce678c5f98 in fuse_session_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #3 0x00007fce678cb577 in fuse_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2 No symbol table info available. #4 0x00005617f6d5ab0e in main (argc=<optimized out>, argv=<optimized out>) at pmxcfs.c:1055 ret = -1 lockfd = <optimized out> pipefd = {8, 9} foreground = 0 force_local_mode = 0 wrote_pidfile = 1 memdb = 0x5617f7c563b0 dcdb = 0x5617f8046ca0 status_fsm = 0x5617f806a630 context = <optimized out> entries = {{long_name = 0x5617f6d7440f "debug", short_name = 100 'd', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x5617f6d82104 <cfs+20>, description = 0x5617f6d742cb "Turn on debug messages", arg_description = 0x0}, { long_name = 0x5617f6d742e2 "foreground", short_name = 102 'f', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc38, description = 0x5617f6d742ed "Do not daemonize server", arg_description = 0x0}, { long_name = 0x5617f6d74305 "local", short_name = 108 'l', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffd672edc3c, description = 0x5617f6d746a0 "Force local mode (ignore corosync.conf, force quorum)", arg_description = 0x0}, {long_name = 0x0, short_name = 0 '\000', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x0, description = 0x0, arg_description = 0x0}} err = 0x0 __func__ = "main" utsname = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "m6kvm1", '\000' <repeats 58 times>, release = "5.4.60-1-pve", '\000' <repeats 52 times>, version = "#1 SMP PVE 5.4.60-1 (Mon, 31 Aug 2020 10:36:22 +0200)", '\000' <repeats 11 times>, machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>} dot = <optimized out> www_data = <optimized out> create = <optimized out> conf_data = 0x5617f80466a0 len = <optimized out> config = <optimized out> bplug = <optimized out> fa = {0x5617f6d7441e "-f", 0x5617f6d74421 "-odefault_permissions", 0x5617f6d74437 "-oallow_other", 0x0} fuse_args = {argc = 1, argv = 0x5617f8046960, allocated = 1} fuse_chan = 0x5617f7c560c0 corosync_loop = 0x5617f80481d0 service_quorum = 0x5617f8048460 service_confdb = 0x5617f8069da0 service_dcdb = 0x5617f806a5d0 service_status = 0x5617f806a8e0 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Mercredi 16 Septembre 2020 09:34:27 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Hi, I have produced the problem now, and this time I don't have restarted corosync a second after the lock of /etc/pve so, currently it's readonly. I don't have used gdb since a long time, could you tell me how to attach to the running pmxcfs process, and some gbd commands to launch ? ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Mardi 15 Septembre 2020 17:58:33 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Another small lock at 17:41:09 To be sure, I have done a small loop of write each second in /etc/pve, node node2. it's hanging at first corosync restart, then, on second corosync restart it's working again. I'll try to improve this tomorrow to be able to debug corosync process - restarting corosync do some write in /etc/pve/ - and if it's hanging don't restart corosync again node2: echo test > /etc/pve/test loop -------------------------------------- Current time : 17:41:01 Current time : 17:41:02 Current time : 17:41:03 Current time : 17:41:04 Current time : 17:41:05 Current time : 17:41:06 Current time : 17:41:07 Current time : 17:41:08 Current time : 17:41:09 hang Current time : 17:42:05 Current time : 17:42:06 Current time : 17:42:07 node1 ----- Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:08 m6kvm1 corosync[18145]: [TOTEM ] Knet pMTU change: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 Sep 15 17:41:08 m6kvm1 corosync[18145]: [KNET ] pmtud: Global data MTU changed to: 1397 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-31-zx6KJM/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] lib_init_fn: conn=0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got quorum_type request on 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] got trackstart request on 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending initial status to 0x556c2918d5f0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QUORUM] sending quorum notification to 0x556c2918d5f0, length = 52 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CMAP ] lib_init_fn: conn=0x556c2918ef20 Sep 15 17:41:09 m6kvm1 pmxcfs[16239]: [status] notice: update cluster info (cluster name m6kvm, version = 20) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-33-6RKbvH/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918ad00, cpd=0x556c2918b50c Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] IPC credentials authenticated (/dev/shm/qb-18145-16239-34-GAY5T9/qb) Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] connecting to client [16239] Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] connection created Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] lib_init_fn: conn=0x556c2918c740, cpd=0x556c2918ce8c Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Creating commit token because I am the rep. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Saving state aru 5 high seq received 5 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Storing new sequence id for ring 1197 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering COMMIT state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering RECOVERY state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] TRANS [0] member 1: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [0] member 1: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (1.1193) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 5 high delivered 5 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [1] member 2: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [2] member 3: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [3] member 4: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) ep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [4] member 5: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [5] member 6: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [6] member 7: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [7] member 8: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [8] member 9: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [9] member 10: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [10] member 11: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [11] member 12: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [12] member 13: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] position [13] member 14: Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Did not need to originate any messages in recovery. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Sending initial ORF token Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] Resetting old ring state Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] recovery to regular 1-0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] waiting_trans_ack changed to 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.90) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.91) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.92) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.93) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.94) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.95) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.96) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.97) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.107) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.108) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.109) Sep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.110) ep 15 17:41:09 m6kvm1 corosync[18145]: [MAIN ] Member joined: r(0) ip(10.3.94.111) Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] call init for locally known services Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] entering OPERATIONAL state. Sep 15 17:41:09 m6kvm1 corosync[18145]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 2 3 4 5 6 7 8 9 10 11 12 13 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] enter sync process Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync configuration map access Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 3 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 4 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 5 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 6 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 7 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 8 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 9 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 10 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 11 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 12 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] got joinlist message from node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] my downlist: members(old:1 left:0) Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm1 corosync[18145]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=2, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=3, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] total_votes=4, expected_votes=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm1 corosync[18145]: [VOTEQ ] got nodeinfo message from cluster node 3 .... .... next corosync restart Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] Node was shut down by a signal Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Unloading all Corosync service engines. Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] withdrawing server sockets Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_unref() - destroying Sep 15 17:42:03 m6kvm1 corosync[18145]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Sep 15 17:42:03 m6kvm1 corosync[18145]: [QB ] qb_ipcs_disconnect(/dev/shm/qb-18145-16239-32-I7ZZ6e/qb) state:2 Sep 15 17:42:03 m6kvm1 pmxcfs[16239]: [confdb] crit: cmap_dispatch failed: 2 Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_closed() Sep 15 17:42:03 m6kvm1 corosync[18145]: [CMAP ] exit_fn for conn=0x556c2918ef20 Sep 15 17:42:03 m6kvm1 corosync[18145]: [MAIN ] cs_ipcs_connection_destroyed() node2 ----- Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 10 link: 0 Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:05 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 10 link: 0 current link mtu: 1397 Sep 15 17:41:07 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 received pong: 2 Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [TOTEM ] entering GATHER state from 11(merge during join). Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:08 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: Source host 1 not reachable yet. Discarding packet. Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] rx: host: 1 link: 0 is up Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Knet host change callback. nodeid: 1 reachable: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: Starting PMTUD for host: 1 link: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] udp: detected kernel MTU: 1500 Sep 15 17:41:09 m6kvm2 corosync[25411]: [KNET ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Saving state aru 123 high seq received 123 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Storing new sequence id for ring 1197 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering COMMIT state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] got commit token Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering RECOVERY state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [0] member 2: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [1] member 3: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [2] member 4: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [3] member 5: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [4] member 6: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [5] member 7: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [6] member 8: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [7] member 9: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [8] member 10: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [9] member 11: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [10] member 12: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [11] member 13: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] TRANS [12] member 14: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [0] member 1: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (1.1193) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 5 high delivered 5 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [1] member 2: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [2] member 3: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [3] member 4: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [4] member 5: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [5] member 6: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [6] member 7: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [7] member 8: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [8] member 9: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [9] member 10: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [10] member 11: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [11] member 12: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [12] member 13: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] position [13] member 14: Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] previous ringid (2.1192) Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] aru 123 high delivered 123 received flag 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Did not need to originate any messages in recovery. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] install seq 0 aru 0 high seq received 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] Resetting old ring state Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] recovery to regular 1-0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Member joined: r(0) ip(10.3.94.89) Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] call init for locally known services Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] entering OPERATIONAL state. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] A new membership (1.1197) was formed. Members joined: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] enter sync process Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync configuration map access Sep 15 17:41:09 m6kvm2 corosync[25411]: [CMAP ] Not first sync -> no action Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] downlist left_list: 0 received Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got joinlist message from node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] my downlist: members(old:13 left:0) Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[0] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[1] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.110) , pid:30209 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[2] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[3] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.109) , pid:31350 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[4] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[5] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.108) , pid:3569 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[6] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[7] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.107) , pid:19504 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[8] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[9] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.97) , pid:11947 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[10] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[11] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.96) , pid:20814 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[12] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[13] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.95) , pid:39420 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[14] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[15] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.94) , pid:12452 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[16] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[17] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.93) , pid:44300 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[18] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[19] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.92) , pid:42259 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[20] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[21] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.91) , pid:40630 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[22] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[23] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.90) , pid:25870 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[24] group:pve_kvstore_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] joinlist_messages[25] group:pve_dcdb_v1\x00, ip:r(0) ip(10.3.94.111) , pid:25634 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending nodelist callback. ring_id = 1.1197 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[13]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 13 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[14]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 14 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: No Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 3 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[4]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 4 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[5]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 5 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[6]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 6 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[7]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 7 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[8]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 8 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[9]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 9 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[10]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 10 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[11]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 11 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[12]: votes: 1, expected: 14 flags: 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] got nodeinfo message from cluster node 12 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [SYNC ] Committing synchronization for corosync vote quorum service v1.0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] total_votes=14, expected_votes=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 1 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 3 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 4 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 5 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 6 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 7 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 8 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 9 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 10 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 11 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 12 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 13 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 14 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] node 2 state=1, votes=1, expected=14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] lowest node id: 1 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] highest node id: 14 us: 2 Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sep 15 17:41:09 m6kvm2 corosync[25411]: [QUORUM] sending quorum notification to (nil), length = 104 Sep 15 17:41:09 m6kvm2 corosync[25411]: [VOTEQ ] Sending quorum callback, quorate = 1 Sep 15 17:41:09 m6kvm2 corosync[25411]: [MAIN ] Completed service synchronization, ready to provide service. Sep 15 17:41:09 m6kvm2 corosync[25411]: [TOTEM ] waiting_trans_ack changed to 0 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 Sep 15 17:41:09 m6kvm2 corosync[25411]: [CPG ] got procjoin message from cluster node 1 (r(0) ip(10.3.94.89) ) for pid 16239 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:57:46 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>I mean this is bad, but also great! >>Cam you do a coredump of the whole thing and upload it somewhere with the version info >>used (for dbgsym package)? That could help a lot. I'll try to reproduce it again (with the full lock everywhere), and do the coredump. I have tried the real time scheduling, but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking something at next corosync restart.) this time it was blocked at the same time on a node in: work { ... } elsif ($state eq 'active') { .... $self->update_lrm_status(); and another node in if ($fence_request) { $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); $self->set_local_status({ state => 'lost_agent_lock'}); } elsif (!$self->get_protected_ha_agent_lock()) { $self->set_local_status({ state => 'lost_agent_lock'}); } elsif ($self->{mode} eq 'maintenance') { $self->set_local_status({ state => 'maintenance'}); } ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 16:32:52 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 4:09 PM, Alexandre DERUMIER wrote: >>> Can you try to give pmxcfs real time scheduling, e.g., by doing: >>> >>> # systemctl edit pve-cluster >>> >>> And then add snippet: >>> >>> >>> [Service] >>> CPUSchedulingPolicy=rr >>> CPUSchedulingPriority=99 > yes, sure, I'll do it now > > >> I'm currently digging the logs >>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node? > yes, a simple "systemctl restart corosync" on 1 node each minute > > > > After 1hour, it's still locked. > > on other nodes, I still have pmxfs logs like: > I mean this is bad, but also great! Cam you do a coredump of the whole thing and upload it somewhere with the version info used (for dbgsym package)? That could help a lot. > manual "pmxcfs -d" > https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e > Hmm, the fuse connection of the previous one got into a weird state (or something is still running) but I'd rather say this is a side-effect not directly connected to the real bug. > > some interesting dmesg about "pvesr" > > [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds. > [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 > [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 > [Tue Sep 15 14:45:34 2020] Call Trace: > [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 > [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 > [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 > [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 > [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 > [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 > [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 > [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 > [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 > [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > hmm, hangs in mkdir (cluster wide locking) _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-16 8:30 ` Alexandre DERUMIER @ 2020-09-16 8:53 ` Alexandre DERUMIER [not found] ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com> 1 sibling, 0 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-16 8:53 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht my last mail was too long, here the backtrace of pmxcfs in gist node1-bt -> where the corosync is restarted each minute -------- https://gist.github.com/aderumier/ed21f22aa6ed9099ec0199255112f6b6 node2-bt -> other node hanging too -------- https://gist.github.com/aderumier/31fb72b93e77a93fbaec975bc54dfb3a ^ permalink raw reply [flat|nested] 84+ messages in thread
[parent not found: <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com>]
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown [not found] ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com> @ 2020-09-16 13:15 ` Alexandre DERUMIER 2020-09-16 14:45 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-16 13:15 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht I have reproduce it again, with pmxcfs in debug mode corosync restart at 15:02:10, and it was already block on other nodes at 15:02:12 The pmxcfs was still logging after the lock. here the log on node1 where corosync has been restarted http://odisoweb1.odiso.net/pmxcfs-corosync.log ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-16 13:15 ` Alexandre DERUMIER @ 2020-09-16 14:45 ` Thomas Lamprecht 2020-09-16 15:17 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-16 14:45 UTC (permalink / raw) To: Alexandre DERUMIER, Proxmox VE development discussion On 9/16/20 3:15 PM, Alexandre DERUMIER wrote: > I have reproduce it again, with pmxcfs in debug mode > > corosync restart at 15:02:10, and it was already block on other nodes at 15:02:12 > > The pmxcfs was still logging after the lock. > > > here the log on node1 where corosync has been restarted > > http://odisoweb1.odiso.net/pmxcfs-corosync.log > thanks for those, I need a bit to sift through them. Seem like either dfsm gets out of sync or we do not get a ACK reply from cpg_send. A full core dump would be still nice, in gdb: generate-core-file PS: instead of manually switching to threads you can do: thread apply all bt full to get a backtrace for all threads in one command ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-16 14:45 ` Thomas Lamprecht @ 2020-09-16 15:17 ` Alexandre DERUMIER 2020-09-17 9:21 ` Fabian Grünbichler 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-16 15:17 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion I have produce it again, with the coredump this time restart corosync : 17:05:27 http://odisoweb1.odiso.net/pmxcfs-corosync2.log bt full https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b coredump http://odisoweb1.odiso.net/core.7761.gz ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "aderumier" <aderumier@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mercredi 16 Septembre 2020 16:45:12 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/16/20 3:15 PM, Alexandre DERUMIER wrote: > I have reproduce it again, with pmxcfs in debug mode > > corosync restart at 15:02:10, and it was already block on other nodes at 15:02:12 > > The pmxcfs was still logging after the lock. > > > here the log on node1 where corosync has been restarted > > http://odisoweb1.odiso.net/pmxcfs-corosync.log > thanks for those, I need a bit to sift through them. Seem like either dfsm gets out of sync or we do not get a ACK reply from cpg_send. A full core dump would be still nice, in gdb: generate-core-file PS: instead of manually switching to threads you can do: thread apply all bt full to get a backtrace for all threads in one command ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-16 15:17 ` Alexandre DERUMIER @ 2020-09-17 9:21 ` Fabian Grünbichler 2020-09-17 9:59 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Fabian Grünbichler @ 2020-09-17 9:21 UTC (permalink / raw) To: Proxmox VE development discussion, Thomas Lamprecht On September 16, 2020 5:17 pm, Alexandre DERUMIER wrote: > I have produce it again, with the coredump this time > > > restart corosync : 17:05:27 > > http://odisoweb1.odiso.net/pmxcfs-corosync2.log > > > bt full > > https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b > > > coredump > > > http://odisoweb1.odiso.net/core.7761.gz just a short update on this: dcdb is stuck in START_SYNC mode, but nodeid 13 hasn't sent a STATE msg (yet). this looks like either the START_SYNC message to node 13, or the STATE response from it got lost or processed wrong. until the mode switches to SYNCED (after all states have been received and the state update went through), regular/normal messages can be sent, but the incoming normal messages are queued and not processed. this is why the fuse access blocks, it sends the request out, but the response ends up in the queue. status (the other thing running on top of dfsm) got correctly synced up at the same time, so it's either a dcdb specific bug, or just bad luck that one was affected and the other wasn't. unfortunately even with debug enabled the logs don't contain much information that would help (e.g., we don't log sending/receiving STATE messages except when they look 'wrong'), so Thomas is trying to reproduce this using your scenario here to improve turn around time. if we can't reproduce it, we'll have to send you patches/patched debs with increased logging to narrow down what is going on. if we can, than we can hopefully find and fix the issue fast. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-17 9:21 ` Fabian Grünbichler @ 2020-09-17 9:59 ` Alexandre DERUMIER 2020-09-17 10:02 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-17 9:59 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht Thanks for the update. >> if >>we can't reproduce it, we'll have to send you patches/patched debs with >>increased logging to narrow down what is going on. if we can, than we >>can hopefully find and fix the issue fast. No problem, I can install the patched deb if needed. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Jeudi 17 Septembre 2020 11:21:45 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 16, 2020 5:17 pm, Alexandre DERUMIER wrote: > I have produce it again, with the coredump this time > > > restart corosync : 17:05:27 > > http://odisoweb1.odiso.net/pmxcfs-corosync2.log > > > bt full > > https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b > > > coredump > > > http://odisoweb1.odiso.net/core.7761.gz just a short update on this: dcdb is stuck in START_SYNC mode, but nodeid 13 hasn't sent a STATE msg (yet). this looks like either the START_SYNC message to node 13, or the STATE response from it got lost or processed wrong. until the mode switches to SYNCED (after all states have been received and the state update went through), regular/normal messages can be sent, but the incoming normal messages are queued and not processed. this is why the fuse access blocks, it sends the request out, but the response ends up in the queue. status (the other thing running on top of dfsm) got correctly synced up at the same time, so it's either a dcdb specific bug, or just bad luck that one was affected and the other wasn't. unfortunately even with debug enabled the logs don't contain much information that would help (e.g., we don't log sending/receiving STATE messages except when they look 'wrong'), so Thomas is trying to reproduce this using your scenario here to improve turn around time. if we can't reproduce it, we'll have to send you patches/patched debs with increased logging to narrow down what is going on. if we can, than we can hopefully find and fix the issue fast. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-17 9:59 ` Alexandre DERUMIER @ 2020-09-17 10:02 ` Alexandre DERUMIER 2020-09-17 11:35 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-17 10:02 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht if needed, here my test script to reproduce it node1 (restart corosync until node2 don't send the timestamp anymore) ----- #!/bin/bash for i in `seq 10000`; do now=$(date +"%T") echo "restart corosync : $now" systemctl restart corosync for j in {1..59}; do last=$(cat /tmp/timestamp) curr=`date '+%s'` diff=$(($curr - $last)) if [ $diff -gt 20 ]; then echo "too old" exit 0 fi sleep 1 done done node2 (write to /etc/pve/test each second, then send the last timestamp to node1) ----- #!/bin/bash for i in {1..10000}; do now=$(date +"%T") echo "Current time : $now" curr=`date '+%s'` ssh root@node1 "echo $curr > /tmp/timestamp" echo "test" > /etc/pve/test sleep 1 done ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Jeudi 17 Septembre 2020 11:59:32 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Thanks for the update. >> if >>we can't reproduce it, we'll have to send you patches/patched debs with >>increased logging to narrow down what is going on. if we can, than we >>can hopefully find and fix the issue fast. No problem, I can install the patched deb if needed. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Thomas Lamprecht" <t.lamprecht@proxmox.com> Envoyé: Jeudi 17 Septembre 2020 11:21:45 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 16, 2020 5:17 pm, Alexandre DERUMIER wrote: > I have produce it again, with the coredump this time > > > restart corosync : 17:05:27 > > http://odisoweb1.odiso.net/pmxcfs-corosync2.log > > > bt full > > https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b > > > coredump > > > http://odisoweb1.odiso.net/core.7761.gz just a short update on this: dcdb is stuck in START_SYNC mode, but nodeid 13 hasn't sent a STATE msg (yet). this looks like either the START_SYNC message to node 13, or the STATE response from it got lost or processed wrong. until the mode switches to SYNCED (after all states have been received and the state update went through), regular/normal messages can be sent, but the incoming normal messages are queued and not processed. this is why the fuse access blocks, it sends the request out, but the response ends up in the queue. status (the other thing running on top of dfsm) got correctly synced up at the same time, so it's either a dcdb specific bug, or just bad luck that one was affected and the other wasn't. unfortunately even with debug enabled the logs don't contain much information that would help (e.g., we don't log sending/receiving STATE messages except when they look 'wrong'), so Thomas is trying to reproduce this using your scenario here to improve turn around time. if we can't reproduce it, we'll have to send you patches/patched debs with increased logging to narrow down what is going on. if we can, than we can hopefully find and fix the issue fast. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-17 10:02 ` Alexandre DERUMIER @ 2020-09-17 11:35 ` Thomas Lamprecht 2020-09-20 23:54 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-17 11:35 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: > if needed, here my test script to reproduce it thanks, I'm now using this specific one, had a similar (but all nodes writes) running here since ~ two hours without luck yet, lets see how this behaves. > > node1 (restart corosync until node2 don't send the timestamp anymore) > ----- > > #!/bin/bash > > for i in `seq 10000`; do > now=$(date +"%T") > echo "restart corosync : $now" > systemctl restart corosync > for j in {1..59}; do > last=$(cat /tmp/timestamp) > curr=`date '+%s'` > diff=$(($curr - $last)) > if [ $diff -gt 20 ]; then > echo "too old" > exit 0 > fi > sleep 1 > done > done > > > > node2 (write to /etc/pve/test each second, then send the last timestamp to node1) > ----- > #!/bin/bash > for i in {1..10000}; > do > now=$(date +"%T") > echo "Current time : $now" > curr=`date '+%s'` > ssh root@node1 "echo $curr > /tmp/timestamp" > echo "test" > /etc/pve/test > sleep 1 > done > ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-17 11:35 ` Thomas Lamprecht @ 2020-09-20 23:54 ` Alexandre DERUMIER 2020-09-22 5:43 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-20 23:54 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox VE development discussion Hi, I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s. I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes. I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes node1: corosync stop: 01:26:50 node2 : /etc/pve locked : 01:26:51 http://odisoweb1.odiso.net/corosync-stop.log pmxcfs : bt full all threads: https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65 pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz node1:corosync start: 01:35:36 http://odisoweb1.odiso.net/corosync-start.log BTW, I have been contacted in pm on the forum by a user following this mailing thread, and he had exactly the same problem with a 7 nodes cluster recently. (shutting down 1 node, /etc/pve was locked until the node was restarted) ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com> Envoyé: Jeudi 17 Septembre 2020 13:35:55 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: > if needed, here my test script to reproduce it thanks, I'm now using this specific one, had a similar (but all nodes writes) running here since ~ two hours without luck yet, lets see how this behaves. > > node1 (restart corosync until node2 don't send the timestamp anymore) > ----- > > #!/bin/bash > > for i in `seq 10000`; do > now=$(date +"%T") > echo "restart corosync : $now" > systemctl restart corosync > for j in {1..59}; do > last=$(cat /tmp/timestamp) > curr=`date '+%s'` > diff=$(($curr - $last)) > if [ $diff -gt 20 ]; then > echo "too old" > exit 0 > fi > sleep 1 > done > done > > > > node2 (write to /etc/pve/test each second, then send the last timestamp to node1) > ----- > #!/bin/bash > for i in {1..10000}; > do > now=$(date +"%T") > echo "Current time : $now" > curr=`date '+%s'` > ssh root@node1 "echo $curr > /tmp/timestamp" > echo "test" > /etc/pve/test > sleep 1 > done > ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-20 23:54 ` Alexandre DERUMIER @ 2020-09-22 5:43 ` Alexandre DERUMIER 2020-09-24 14:02 ` Fabian Grünbichler 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-22 5:43 UTC (permalink / raw) To: Proxmox VE development discussion; +Cc: Thomas Lamprecht I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, but after that it's become available again. So, it's really something when corosync is in shutdown phase, and pmxcfs is running. So, for now, as workaround, I have changed /lib/systemd/system/pve-cluster.service #Wants=corosync.service #Before=corosync.service Requires=corosync.service After=corosync.service Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Thomas Lamprecht" <t.lamprecht@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 21 Septembre 2020 01:54:59 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Hi, I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s. I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes. I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes node1: corosync stop: 01:26:50 node2 : /etc/pve locked : 01:26:51 http://odisoweb1.odiso.net/corosync-stop.log pmxcfs : bt full all threads: https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65 pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz node1:corosync start: 01:35:36 http://odisoweb1.odiso.net/corosync-start.log BTW, I have been contacted in pm on the forum by a user following this mailing thread, and he had exactly the same problem with a 7 nodes cluster recently. (shutting down 1 node, /etc/pve was locked until the node was restarted) ----- Mail original ----- De: "Thomas Lamprecht" <t.lamprecht@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderumier@odiso.com> Envoyé: Jeudi 17 Septembre 2020 13:35:55 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: > if needed, here my test script to reproduce it thanks, I'm now using this specific one, had a similar (but all nodes writes) running here since ~ two hours without luck yet, lets see how this behaves. > > node1 (restart corosync until node2 don't send the timestamp anymore) > ----- > > #!/bin/bash > > for i in `seq 10000`; do > now=$(date +"%T") > echo "restart corosync : $now" > systemctl restart corosync > for j in {1..59}; do > last=$(cat /tmp/timestamp) > curr=`date '+%s'` > diff=$(($curr - $last)) > if [ $diff -gt 20 ]; then > echo "too old" > exit 0 > fi > sleep 1 > done > done > > > > node2 (write to /etc/pve/test each second, then send the last timestamp to node1) > ----- > #!/bin/bash > for i in {1..10000}; > do > now=$(date +"%T") > echo "Current time : $now" > curr=`date '+%s'` > ssh root@node1 "echo $curr > /tmp/timestamp" > echo "test" > /etc/pve/test > sleep 1 > done > _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-22 5:43 ` Alexandre DERUMIER @ 2020-09-24 14:02 ` Fabian Grünbichler 2020-09-24 14:29 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Fabian Grünbichler @ 2020-09-24 14:02 UTC (permalink / raw) To: Proxmox VE development discussion On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: > I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, > but after that it's become available again. > > > So, it's really something when corosync is in shutdown phase, and pmxcfs is running. > > So, for now, as workaround, I have changed > > /lib/systemd/system/pve-cluster.service > > #Wants=corosync.service > #Before=corosync.service > Requires=corosync.service > After=corosync.service > > > Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. if you are still able to test, it would be great if you could give the following packages a spin (they only contain some extra debug prints on message processing/sending): http://download.proxmox.com/temp/pmxcfs-dbg/ 64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb ideally, you could get the debug logs from all nodes, and the coredump/bt from the node where pmxcfs hangs. thanks! diff --git a/data/src/dfsm.c b/data/src/dfsm.c index 529c7f9..e0bd93f 100644 --- a/data/src/dfsm.c +++ b/data/src/dfsm.c @@ -162,8 +162,8 @@ static void dfsm_send_sync_message_abort(dfsm_t *dfsm) { g_return_if_fail(dfsm != NULL); - g_mutex_lock (&dfsm->sync_mutex); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); dfsm->msgcount_rcvd = dfsm->msgcount; g_cond_broadcast (&dfsm->sync_cond); g_mutex_unlock (&dfsm->sync_mutex); @@ -181,6 +181,7 @@ dfsm_record_local_result( g_mutex_lock (&dfsm->sync_mutex); dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); + cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); if (rp) { rp->result = msg_result; rp->processed = processed; @@ -235,6 +236,8 @@ dfsm_send_state_message_full( g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); + dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; @@ -317,6 +320,7 @@ dfsm_send_message_sync( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; + cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); g_mutex_unlock (&dfsm->sync_mutex); @@ -335,10 +339,12 @@ dfsm_send_message_sync( if (rp) { g_mutex_lock (&dfsm->sync_mutex); - while (dfsm->msgcount_rcvd < msgcount) + while (dfsm->msgcount_rcvd < msgcount) { + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); + } + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); - g_hash_table_remove(dfsm->results, &rp->msgcount); g_mutex_unlock (&dfsm->sync_mutex); @@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( return; } + cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); + if (base_header->type == DFSM_MESSAGE_NORMAL) { dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", + base_header->type, base_header->subtype, msg_len); if (msg_len < sizeof(dfsm_message_normal_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", @@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( } else { int msg_res = -1; + cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", + header->count, base_header->subtype, msg_len); int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), @@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( */ dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", + base_header->type, base_header->subtype, msg_len, mode); if (msg_len < sizeof(dfsm_message_state_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", @@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( if (mode == DFSM_MODE_SYNCED) { if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); for (int i = 0; i < dfsm->sync_info->node_count; i++) dfsm->sync_info->nodes[i].synced = 1; @@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { + cfs_dom_debug(dfsm->log_domain, "received verify request message"); if (msg_len != sizeof(dfsm->csum_counter)) { cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); @@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_START_SYNC) { if (base_header->type == DFSM_MESSAGE_SYNC_START) { - if (nodeid != dfsm->lowest_nodeid) { cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", nodeid, pid); @@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_STATE) { + cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); dfsm_node_info_t *ni; @@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( goto leave; } + } else { + cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); } return; @@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_UPDATE) { if (base_header->type == DFSM_MESSAGE_UPDATE) { + cfs_dom_debug(dfsm->log_domain, "received update message"); int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); @@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( } else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); @@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( size_t joined_list_entries) { cs_error_t result; + cfs_debug("dfsm_cpg_confchg_callback called"); dfsm_t *dfsm = NULL; result = cpg_context_get(handle, (gpointer *)&dfsm); ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-24 14:02 ` Fabian Grünbichler @ 2020-09-24 14:29 ` Alexandre DERUMIER 2020-09-24 18:07 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-24 14:29 UTC (permalink / raw) To: Proxmox VE development discussion Hi fabian, >>if you are still able to test, it would be great if you could give the >>following packages a spin (they only contain some extra debug prints >>on message processing/sending): Sure, no problem, I'm going to test it tonight. >>ideally, you could get the debug logs from all nodes, and the >>coredump/bt from the node where pmxcfs hangs. thanks! ok,no problem. I'll keep you in touch tomorrow. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 16:02:04 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: > I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, > but after that it's become available again. > > > So, it's really something when corosync is in shutdown phase, and pmxcfs is running. > > So, for now, as workaround, I have changed > > /lib/systemd/system/pve-cluster.service > > #Wants=corosync.service > #Before=corosync.service > Requires=corosync.service > After=corosync.service > > > Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. if you are still able to test, it would be great if you could give the following packages a spin (they only contain some extra debug prints on message processing/sending): http://download.proxmox.com/temp/pmxcfs-dbg/ 64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb ideally, you could get the debug logs from all nodes, and the coredump/bt from the node where pmxcfs hangs. thanks! diff --git a/data/src/dfsm.c b/data/src/dfsm.c index 529c7f9..e0bd93f 100644 --- a/data/src/dfsm.c +++ b/data/src/dfsm.c @@ -162,8 +162,8 @@ static void dfsm_send_sync_message_abort(dfsm_t *dfsm) { g_return_if_fail(dfsm != NULL); - g_mutex_lock (&dfsm->sync_mutex); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); dfsm->msgcount_rcvd = dfsm->msgcount; g_cond_broadcast (&dfsm->sync_cond); g_mutex_unlock (&dfsm->sync_mutex); @@ -181,6 +181,7 @@ dfsm_record_local_result( g_mutex_lock (&dfsm->sync_mutex); dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); + cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); if (rp) { rp->result = msg_result; rp->processed = processed; @@ -235,6 +236,8 @@ dfsm_send_state_message_full( g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); + dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; @@ -317,6 +320,7 @@ dfsm_send_message_sync( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; + cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); g_mutex_unlock (&dfsm->sync_mutex); @@ -335,10 +339,12 @@ dfsm_send_message_sync( if (rp) { g_mutex_lock (&dfsm->sync_mutex); - while (dfsm->msgcount_rcvd < msgcount) + while (dfsm->msgcount_rcvd < msgcount) { + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); + } + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); - g_hash_table_remove(dfsm->results, &rp->msgcount); g_mutex_unlock (&dfsm->sync_mutex); @@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( return; } + cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); + if (base_header->type == DFSM_MESSAGE_NORMAL) { dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", + base_header->type, base_header->subtype, msg_len); if (msg_len < sizeof(dfsm_message_normal_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", @@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( } else { int msg_res = -1; + cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", + header->count, base_header->subtype, msg_len); int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), @@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( */ dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", + base_header->type, base_header->subtype, msg_len, mode); if (msg_len < sizeof(dfsm_message_state_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", @@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( if (mode == DFSM_MODE_SYNCED) { if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); for (int i = 0; i < dfsm->sync_info->node_count; i++) dfsm->sync_info->nodes[i].synced = 1; @@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { + cfs_dom_debug(dfsm->log_domain, "received verify request message"); if (msg_len != sizeof(dfsm->csum_counter)) { cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); @@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_START_SYNC) { if (base_header->type == DFSM_MESSAGE_SYNC_START) { - if (nodeid != dfsm->lowest_nodeid) { cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", nodeid, pid); @@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_STATE) { + cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); dfsm_node_info_t *ni; @@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( goto leave; } + } else { + cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); } return; @@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_UPDATE) { if (base_header->type == DFSM_MESSAGE_UPDATE) { + cfs_dom_debug(dfsm->log_domain, "received update message"); int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); @@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( } else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); @@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( size_t joined_list_entries) { cs_error_t result; + cfs_debug("dfsm_cpg_confchg_callback called"); dfsm_t *dfsm = NULL; result = cpg_context_get(handle, (gpointer *)&dfsm); _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-24 14:29 ` Alexandre DERUMIER @ 2020-09-24 18:07 ` Alexandre DERUMIER 2020-09-25 6:44 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-24 18:07 UTC (permalink / raw) To: Proxmox VE development discussion I was able to reproduce stop corosync on node1 : 18:12:29 /etc/pve locked at 18:12:30 logs of all nodes are here: http://odisoweb1.odiso.net/test1/ I don't have coredump, as my coworker have restarted pmxcfs too fast :/ . Sorry. I'm going to launch another test with coredump this time ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 16:29:17 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Hi fabian, >>if you are still able to test, it would be great if you could give the >>following packages a spin (they only contain some extra debug prints >>on message processing/sending): Sure, no problem, I'm going to test it tonight. >>ideally, you could get the debug logs from all nodes, and the >>coredump/bt from the node where pmxcfs hangs. thanks! ok,no problem. I'll keep you in touch tomorrow. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 16:02:04 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: > I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, > but after that it's become available again. > > > So, it's really something when corosync is in shutdown phase, and pmxcfs is running. > > So, for now, as workaround, I have changed > > /lib/systemd/system/pve-cluster.service > > #Wants=corosync.service > #Before=corosync.service > Requires=corosync.service > After=corosync.service > > > Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. if you are still able to test, it would be great if you could give the following packages a spin (they only contain some extra debug prints on message processing/sending): http://download.proxmox.com/temp/pmxcfs-dbg/ 64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb ideally, you could get the debug logs from all nodes, and the coredump/bt from the node where pmxcfs hangs. thanks! diff --git a/data/src/dfsm.c b/data/src/dfsm.c index 529c7f9..e0bd93f 100644 --- a/data/src/dfsm.c +++ b/data/src/dfsm.c @@ -162,8 +162,8 @@ static void dfsm_send_sync_message_abort(dfsm_t *dfsm) { g_return_if_fail(dfsm != NULL); - g_mutex_lock (&dfsm->sync_mutex); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); dfsm->msgcount_rcvd = dfsm->msgcount; g_cond_broadcast (&dfsm->sync_cond); g_mutex_unlock (&dfsm->sync_mutex); @@ -181,6 +181,7 @@ dfsm_record_local_result( g_mutex_lock (&dfsm->sync_mutex); dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); + cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); if (rp) { rp->result = msg_result; rp->processed = processed; @@ -235,6 +236,8 @@ dfsm_send_state_message_full( g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); + dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; @@ -317,6 +320,7 @@ dfsm_send_message_sync( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; + cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); g_mutex_unlock (&dfsm->sync_mutex); @@ -335,10 +339,12 @@ dfsm_send_message_sync( if (rp) { g_mutex_lock (&dfsm->sync_mutex); - while (dfsm->msgcount_rcvd < msgcount) + while (dfsm->msgcount_rcvd < msgcount) { + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); + } + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); - g_hash_table_remove(dfsm->results, &rp->msgcount); g_mutex_unlock (&dfsm->sync_mutex); @@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( return; } + cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); + if (base_header->type == DFSM_MESSAGE_NORMAL) { dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", + base_header->type, base_header->subtype, msg_len); if (msg_len < sizeof(dfsm_message_normal_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", @@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( } else { int msg_res = -1; + cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", + header->count, base_header->subtype, msg_len); int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), @@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( */ dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", + base_header->type, base_header->subtype, msg_len, mode); if (msg_len < sizeof(dfsm_message_state_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", @@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( if (mode == DFSM_MODE_SYNCED) { if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); for (int i = 0; i < dfsm->sync_info->node_count; i++) dfsm->sync_info->nodes[i].synced = 1; @@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { + cfs_dom_debug(dfsm->log_domain, "received verify request message"); if (msg_len != sizeof(dfsm->csum_counter)) { cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); @@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_START_SYNC) { if (base_header->type == DFSM_MESSAGE_SYNC_START) { - if (nodeid != dfsm->lowest_nodeid) { cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", nodeid, pid); @@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_STATE) { + cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); dfsm_node_info_t *ni; @@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( goto leave; } + } else { + cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); } return; @@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_UPDATE) { if (base_header->type == DFSM_MESSAGE_UPDATE) { + cfs_dom_debug(dfsm->log_domain, "received update message"); int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); @@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( } else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); @@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( size_t joined_list_entries) { cs_error_t result; + cfs_debug("dfsm_cpg_confchg_callback called"); dfsm_t *dfsm = NULL; result = cpg_context_get(handle, (gpointer *)&dfsm); _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-24 18:07 ` Alexandre DERUMIER @ 2020-09-25 6:44 ` Alexandre DERUMIER 2020-09-25 7:15 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-25 6:44 UTC (permalink / raw) To: Proxmox VE development discussion Another test this morning with the coredump available http://odisoweb1.odiso.net/test2/ Something different this time, it has happened on corosync start node1 (corosync start) ------ start corosync : 08:06:56 node2 (/etc/pve locked) ----- Current time : 08:07:01 I had a warning on coredump (gdb) generate-core-file warning: target file /proc/35248/cmdline contained unexpected null characters warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000. Saved corefile core.35248 I hope it's ok. I'll do another test this morning ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 20:07:43 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown I was able to reproduce stop corosync on node1 : 18:12:29 /etc/pve locked at 18:12:30 logs of all nodes are here: http://odisoweb1.odiso.net/test1/ I don't have coredump, as my coworker have restarted pmxcfs too fast :/ . Sorry. I'm going to launch another test with coredump this time ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 16:29:17 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Hi fabian, >>if you are still able to test, it would be great if you could give the >>following packages a spin (they only contain some extra debug prints >>on message processing/sending): Sure, no problem, I'm going to test it tonight. >>ideally, you could get the debug logs from all nodes, and the >>coredump/bt from the node where pmxcfs hangs. thanks! ok,no problem. I'll keep you in touch tomorrow. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 16:02:04 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: > I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, > but after that it's become available again. > > > So, it's really something when corosync is in shutdown phase, and pmxcfs is running. > > So, for now, as workaround, I have changed > > /lib/systemd/system/pve-cluster.service > > #Wants=corosync.service > #Before=corosync.service > Requires=corosync.service > After=corosync.service > > > Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. if you are still able to test, it would be great if you could give the following packages a spin (they only contain some extra debug prints on message processing/sending): http://download.proxmox.com/temp/pmxcfs-dbg/ 64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb ideally, you could get the debug logs from all nodes, and the coredump/bt from the node where pmxcfs hangs. thanks! diff --git a/data/src/dfsm.c b/data/src/dfsm.c index 529c7f9..e0bd93f 100644 --- a/data/src/dfsm.c +++ b/data/src/dfsm.c @@ -162,8 +162,8 @@ static void dfsm_send_sync_message_abort(dfsm_t *dfsm) { g_return_if_fail(dfsm != NULL); - g_mutex_lock (&dfsm->sync_mutex); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); dfsm->msgcount_rcvd = dfsm->msgcount; g_cond_broadcast (&dfsm->sync_cond); g_mutex_unlock (&dfsm->sync_mutex); @@ -181,6 +181,7 @@ dfsm_record_local_result( g_mutex_lock (&dfsm->sync_mutex); dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); + cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); if (rp) { rp->result = msg_result; rp->processed = processed; @@ -235,6 +236,8 @@ dfsm_send_state_message_full( g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); + dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; @@ -317,6 +320,7 @@ dfsm_send_message_sync( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; + cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); g_mutex_unlock (&dfsm->sync_mutex); @@ -335,10 +339,12 @@ dfsm_send_message_sync( if (rp) { g_mutex_lock (&dfsm->sync_mutex); - while (dfsm->msgcount_rcvd < msgcount) + while (dfsm->msgcount_rcvd < msgcount) { + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); + } + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); - g_hash_table_remove(dfsm->results, &rp->msgcount); g_mutex_unlock (&dfsm->sync_mutex); @@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( return; } + cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); + if (base_header->type == DFSM_MESSAGE_NORMAL) { dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", + base_header->type, base_header->subtype, msg_len); if (msg_len < sizeof(dfsm_message_normal_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", @@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( } else { int msg_res = -1; + cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", + header->count, base_header->subtype, msg_len); int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), @@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( */ dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", + base_header->type, base_header->subtype, msg_len, mode); if (msg_len < sizeof(dfsm_message_state_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", @@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( if (mode == DFSM_MODE_SYNCED) { if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); for (int i = 0; i < dfsm->sync_info->node_count; i++) dfsm->sync_info->nodes[i].synced = 1; @@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { + cfs_dom_debug(dfsm->log_domain, "received verify request message"); if (msg_len != sizeof(dfsm->csum_counter)) { cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); @@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_START_SYNC) { if (base_header->type == DFSM_MESSAGE_SYNC_START) { - if (nodeid != dfsm->lowest_nodeid) { cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", nodeid, pid); @@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_STATE) { + cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); dfsm_node_info_t *ni; @@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( goto leave; } + } else { + cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); } return; @@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_UPDATE) { if (base_header->type == DFSM_MESSAGE_UPDATE) { + cfs_dom_debug(dfsm->log_domain, "received update message"); int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); @@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( } else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); @@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( size_t joined_list_entries) { cs_error_t result; + cfs_debug("dfsm_cpg_confchg_callback called"); dfsm_t *dfsm = NULL; result = cpg_context_get(handle, (gpointer *)&dfsm); _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-25 6:44 ` Alexandre DERUMIER @ 2020-09-25 7:15 ` Alexandre DERUMIER 2020-09-25 9:19 ` Fabian Grünbichler 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-25 7:15 UTC (permalink / raw) To: Proxmox VE development discussion Another hang, this time on corosync stop, coredump available http://odisoweb1.odiso.net/test3/ node1 ---- stop corosync : 09:03:10 node2: /etc/pve locked ------ Current time : 09:03:10 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Vendredi 25 Septembre 2020 08:44:24 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Another test this morning with the coredump available http://odisoweb1.odiso.net/test2/ Something different this time, it has happened on corosync start node1 (corosync start) ------ start corosync : 08:06:56 node2 (/etc/pve locked) ----- Current time : 08:07:01 I had a warning on coredump (gdb) generate-core-file warning: target file /proc/35248/cmdline contained unexpected null characters warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000. Saved corefile core.35248 I hope it's ok. I'll do another test this morning ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 20:07:43 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown I was able to reproduce stop corosync on node1 : 18:12:29 /etc/pve locked at 18:12:30 logs of all nodes are here: http://odisoweb1.odiso.net/test1/ I don't have coredump, as my coworker have restarted pmxcfs too fast :/ . Sorry. I'm going to launch another test with coredump this time ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 16:29:17 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Hi fabian, >>if you are still able to test, it would be great if you could give the >>following packages a spin (they only contain some extra debug prints >>on message processing/sending): Sure, no problem, I'm going to test it tonight. >>ideally, you could get the debug logs from all nodes, and the >>coredump/bt from the node where pmxcfs hangs. thanks! ok,no problem. I'll keep you in touch tomorrow. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Jeudi 24 Septembre 2020 16:02:04 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 22, 2020 7:43 am, Alexandre DERUMIER wrote: > I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, > but after that it's become available again. > > > So, it's really something when corosync is in shutdown phase, and pmxcfs is running. > > So, for now, as workaround, I have changed > > /lib/systemd/system/pve-cluster.service > > #Wants=corosync.service > #Before=corosync.service > Requires=corosync.service > After=corosync.service > > > Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. if you are still able to test, it would be great if you could give the following packages a spin (they only contain some extra debug prints on message processing/sending): http://download.proxmox.com/temp/pmxcfs-dbg/ 64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb 04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb ideally, you could get the debug logs from all nodes, and the coredump/bt from the node where pmxcfs hangs. thanks! diff --git a/data/src/dfsm.c b/data/src/dfsm.c index 529c7f9..e0bd93f 100644 --- a/data/src/dfsm.c +++ b/data/src/dfsm.c @@ -162,8 +162,8 @@ static void dfsm_send_sync_message_abort(dfsm_t *dfsm) { g_return_if_fail(dfsm != NULL); - g_mutex_lock (&dfsm->sync_mutex); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); dfsm->msgcount_rcvd = dfsm->msgcount; g_cond_broadcast (&dfsm->sync_cond); g_mutex_unlock (&dfsm->sync_mutex); @@ -181,6 +181,7 @@ dfsm_record_local_result( g_mutex_lock (&dfsm->sync_mutex); dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); + cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); if (rp) { rp->result = msg_result; rp->processed = processed; @@ -235,6 +236,8 @@ dfsm_send_state_message_full( g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); + dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; @@ -317,6 +320,7 @@ dfsm_send_message_sync( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; + cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len); cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); g_mutex_unlock (&dfsm->sync_mutex); @@ -335,10 +339,12 @@ dfsm_send_message_sync( if (rp) { g_mutex_lock (&dfsm->sync_mutex); - while (dfsm->msgcount_rcvd < msgcount) + while (dfsm->msgcount_rcvd < msgcount) { + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); + } + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); - g_hash_table_remove(dfsm->results, &rp->msgcount); g_mutex_unlock (&dfsm->sync_mutex); @@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback( return; } + cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); + if (base_header->type == DFSM_MESSAGE_NORMAL) { dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", + base_header->type, base_header->subtype, msg_len); if (msg_len < sizeof(dfsm_message_normal_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", @@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback( } else { int msg_res = -1; + cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", + header->count, base_header->subtype, msg_len); int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), @@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback( */ dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", + base_header->type, base_header->subtype, msg_len, mode); if (msg_len < sizeof(dfsm_message_state_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", @@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback( if (mode == DFSM_MODE_SYNCED) { if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); for (int i = 0; i < dfsm->sync_info->node_count; i++) dfsm->sync_info->nodes[i].synced = 1; @@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { + cfs_dom_debug(dfsm->log_domain, "received verify request message"); if (msg_len != sizeof(dfsm->csum_counter)) { cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); @@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_START_SYNC) { if (base_header->type == DFSM_MESSAGE_SYNC_START) { - if (nodeid != dfsm->lowest_nodeid) { cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", nodeid, pid); @@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_STATE) { + cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); dfsm_node_info_t *ni; @@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback( goto leave; } + } else { + cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); } return; @@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_UPDATE) { if (base_header->type == DFSM_MESSAGE_UPDATE) { + cfs_dom_debug(dfsm->log_domain, "received update message"); int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); @@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback( } else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); @@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback( size_t joined_list_entries) { cs_error_t result; + cfs_debug("dfsm_cpg_confchg_callback called"); dfsm_t *dfsm = NULL; result = cpg_context_get(handle, (gpointer *)&dfsm); _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-25 7:15 ` Alexandre DERUMIER @ 2020-09-25 9:19 ` Fabian Grünbichler 2020-09-25 9:46 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Fabian Grünbichler @ 2020-09-25 9:19 UTC (permalink / raw) To: Proxmox VE development discussion On September 25, 2020 9:15 am, Alexandre DERUMIER wrote: > > Another hang, this time on corosync stop, coredump available > > http://odisoweb1.odiso.net/test3/ > > > node1 > ---- > stop corosync : 09:03:10 > > node2: /etc/pve locked > ------ > Current time : 09:03:10 thanks, these all indicate the same symptoms: 1. cluster config changes (corosync goes down/comes back up in this case) 2. pmxcfs starts sync process 3. all (online) nodes receive sync request for dcdb and status 4. all nodes send state for dcdb and status via CPG 5. all nodes receive state for dcdb and status from all nodes except one (13 in test 2, 10 in test 3) in step 5, there is no trace of the message on the receiving side, even though the sending node does not log an error. as before, the hang is just a side-effect of the state machine ending up in a state that should be short-lived (syncing, waiting for state from all nodes) with no progress. the code and theory say that this should not happen, as either sending the state fails triggering the node to leave the CPG (restarting the sync), or a node drops out of quorum (triggering a config change, which triggers restarting the sync), or we get all states from all nodes and the sync proceeds. this looks to me like a fundamental assumption/guarantee does not hold.. I will rebuild once more modifying the send code a bit to log a lot more details when sending state messages, it would be great if you could repeat with that as we are still unable to reproduce the issue. hopefully those logs will then indicate whether this is a corosync/knet bug, or if the issue is in our state machine code somewhere. so far it looks more like the former.. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-25 9:19 ` Fabian Grünbichler @ 2020-09-25 9:46 ` Alexandre DERUMIER 2020-09-25 12:51 ` Fabian Grünbichler 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-25 9:46 UTC (permalink / raw) To: Proxmox VE development discussion >>I will rebuild once more modifying the send code a bit to log a lot more >>details when sending state messages, it would be great if you could >>repeat with that as we are still unable to reproduce the issue. ok, no problem, I'm able to easily reproduce it, I'll do new test when you'll send the new version. (and thanks again to debugging this, because It's really beyond my competence) ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Vendredi 25 Septembre 2020 11:19:04 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 25, 2020 9:15 am, Alexandre DERUMIER wrote: > > Another hang, this time on corosync stop, coredump available > > http://odisoweb1.odiso.net/test3/ > > > node1 > ---- > stop corosync : 09:03:10 > > node2: /etc/pve locked > ------ > Current time : 09:03:10 thanks, these all indicate the same symptoms: 1. cluster config changes (corosync goes down/comes back up in this case) 2. pmxcfs starts sync process 3. all (online) nodes receive sync request for dcdb and status 4. all nodes send state for dcdb and status via CPG 5. all nodes receive state for dcdb and status from all nodes except one (13 in test 2, 10 in test 3) in step 5, there is no trace of the message on the receiving side, even though the sending node does not log an error. as before, the hang is just a side-effect of the state machine ending up in a state that should be short-lived (syncing, waiting for state from all nodes) with no progress. the code and theory say that this should not happen, as either sending the state fails triggering the node to leave the CPG (restarting the sync), or a node drops out of quorum (triggering a config change, which triggers restarting the sync), or we get all states from all nodes and the sync proceeds. this looks to me like a fundamental assumption/guarantee does not hold.. I will rebuild once more modifying the send code a bit to log a lot more details when sending state messages, it would be great if you could repeat with that as we are still unable to reproduce the issue. hopefully those logs will then indicate whether this is a corosync/knet bug, or if the issue is in our state machine code somewhere. so far it looks more like the former.. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-25 9:46 ` Alexandre DERUMIER @ 2020-09-25 12:51 ` Fabian Grünbichler 2020-09-25 16:29 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Fabian Grünbichler @ 2020-09-25 12:51 UTC (permalink / raw) To: Proxmox VE development discussion On September 25, 2020 11:46 am, Alexandre DERUMIER wrote: >>>I will rebuild once more modifying the send code a bit to log a lot more >>>details when sending state messages, it would be great if you could >>>repeat with that as we are still unable to reproduce the issue. > > ok, no problem, I'm able to easily reproduce it, I'll do new test when you'll > send the new version. > > (and thanks again to debugging this, because It's really beyond my competence) same procedure as last time, same place, new checksums: 6b5e2defe543a874e0f7a883e40b279c997438dde158566c4c93c11ea531aef924ed2e4eb2506b5064e49ec9bdd4ebe7acd0b9278e9286eac527b0b15a43d8d7 pve-cluster_6.1-8_amd64.deb d57ddc08824055826ee15c9c255690d9140e43f8d5164949108f0dc483a2d181b2bda76f0e7f47202699a062342c0cf0bba8f2ae0f7c5411af9967ef051050a0 pve-cluster-dbgsym_6.1-8_amd64.deb I found one (unfortunately unrelated) bug in our error handling, so there's that at least ;) diff --git a/data/src/dfsm.c b/data/src/dfsm.c index 529c7f9..f3397a0 100644 --- a/data/src/dfsm.c +++ b/data/src/dfsm.c @@ -162,8 +162,8 @@ static void dfsm_send_sync_message_abort(dfsm_t *dfsm) { g_return_if_fail(dfsm != NULL); - g_mutex_lock (&dfsm->sync_mutex); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); dfsm->msgcount_rcvd = dfsm->msgcount; g_cond_broadcast (&dfsm->sync_cond); g_mutex_unlock (&dfsm->sync_mutex); @@ -181,6 +181,7 @@ dfsm_record_local_result( g_mutex_lock (&dfsm->sync_mutex); dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); + cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); if (rp) { rp->result = msg_result; rp->processed = processed; @@ -224,6 +225,48 @@ loop: return result; } +static cs_error_t +dfsm_send_message_full_debug_state( + dfsm_t *dfsm, + struct iovec *iov, + unsigned int len, + int retry) +{ + g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); + g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + + struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; + cs_error_t result; + int retries = 0; + cfs_dom_message(dfsm->log_domain, "send state message debug"); + cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); + for (int i = 0; i < len; i++) + cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); +loop: + cfs_dom_message(dfsm->log_domain, "send state message loop body"); + + result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); + + cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); + if (retry && result == CS_ERR_TRY_AGAIN) { + nanosleep(&tvreq, NULL); + ++retries; + if ((retries % 10) == 0) + cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); + if (retries < 100) + goto loop; + } + + if (retries) + cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); + + if (result != CS_OK && + (!retry || result != CS_ERR_TRY_AGAIN)) + cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); + + return result; +} + static cs_error_t dfsm_send_state_message_full( dfsm_t *dfsm, @@ -235,6 +278,8 @@ dfsm_send_state_message_full( g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); + dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; @@ -252,7 +297,7 @@ dfsm_send_state_message_full( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; - return dfsm_send_message_full(dfsm, real_iov, len + 1, 1); + return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); } cs_error_t @@ -317,6 +362,7 @@ dfsm_send_message_sync( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; + cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, count %" PRIu64 ", msgtype %d, len %d", msgcount, msgtype, len); cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); g_mutex_unlock (&dfsm->sync_mutex); @@ -335,10 +381,12 @@ dfsm_send_message_sync( if (rp) { g_mutex_lock (&dfsm->sync_mutex); - while (dfsm->msgcount_rcvd < msgcount) + while (dfsm->msgcount_rcvd < msgcount) { + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); + } + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); - g_hash_table_remove(dfsm->results, &rp->msgcount); g_mutex_unlock (&dfsm->sync_mutex); @@ -685,9 +733,13 @@ dfsm_cpg_deliver_callback( return; } + cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); + if (base_header->type == DFSM_MESSAGE_NORMAL) { dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", + base_header->type, base_header->subtype, msg_len); if (msg_len < sizeof(dfsm_message_normal_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", @@ -704,6 +756,8 @@ dfsm_cpg_deliver_callback( } else { int msg_res = -1; + cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", + header->count, base_header->subtype, msg_len); int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), @@ -724,6 +778,8 @@ dfsm_cpg_deliver_callback( */ dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", + base_header->type, base_header->subtype, msg_len, mode); if (msg_len < sizeof(dfsm_message_state_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", @@ -744,6 +800,7 @@ dfsm_cpg_deliver_callback( if (mode == DFSM_MODE_SYNCED) { if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); for (int i = 0; i < dfsm->sync_info->node_count; i++) dfsm->sync_info->nodes[i].synced = 1; @@ -754,6 +811,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { + cfs_dom_debug(dfsm->log_domain, "received verify request message"); if (msg_len != sizeof(dfsm->csum_counter)) { cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); @@ -823,7 +881,6 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_START_SYNC) { if (base_header->type == DFSM_MESSAGE_SYNC_START) { - if (nodeid != dfsm->lowest_nodeid) { cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", nodeid, pid); @@ -861,6 +918,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_STATE) { + cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); dfsm_node_info_t *ni; @@ -906,6 +964,8 @@ dfsm_cpg_deliver_callback( goto leave; } + } else { + cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); } return; @@ -914,6 +974,7 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_UPDATE) { if (base_header->type == DFSM_MESSAGE_UPDATE) { + cfs_dom_debug(dfsm->log_domain, "received update message"); int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); @@ -925,6 +986,7 @@ dfsm_cpg_deliver_callback( } else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); @@ -1104,6 +1166,7 @@ dfsm_cpg_confchg_callback( size_t joined_list_entries) { cs_error_t result; + cfs_debug("dfsm_cpg_confchg_callback called"); dfsm_t *dfsm = NULL; result = cpg_context_get(handle, (gpointer *)&dfsm); @@ -1190,7 +1253,7 @@ dfsm_cpg_confchg_callback( dfsm_set_mode(dfsm, DFSM_MODE_START_SYNC); if (lowest_nodeid == dfsm->nodeid) { - if (!dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0)) { + if (dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0) != CS_OK) { cfs_dom_critical(dfsm->log_domain, "failed to send SYNC_START message"); goto leave; } ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-25 12:51 ` Fabian Grünbichler @ 2020-09-25 16:29 ` Alexandre DERUMIER 2020-09-28 9:17 ` Fabian Grünbichler 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-25 16:29 UTC (permalink / raw) To: Proxmox VE development discussion here a new hang: http://odisoweb1.odiso.net/test4/ This time on corosync start. node1: ----- start corosync : 17:22:02 node2 ----- /etc/pve locked 17:22:07 Something new: where doing coredump or bt-full on pmxcfs on node1, this have unlocked /etc/pve on other nodes /etc/pve unlocked(with coredump or bt-full): 17:57:40 ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Vendredi 25 Septembre 2020 14:51:30 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 25, 2020 11:46 am, Alexandre DERUMIER wrote: >>>I will rebuild once more modifying the send code a bit to log a lot more >>>details when sending state messages, it would be great if you could >>>repeat with that as we are still unable to reproduce the issue. > > ok, no problem, I'm able to easily reproduce it, I'll do new test when you'll > send the new version. > > (and thanks again to debugging this, because It's really beyond my competence) same procedure as last time, same place, new checksums: 6b5e2defe543a874e0f7a883e40b279c997438dde158566c4c93c11ea531aef924ed2e4eb2506b5064e49ec9bdd4ebe7acd0b9278e9286eac527b0b15a43d8d7 pve-cluster_6.1-8_amd64.deb d57ddc08824055826ee15c9c255690d9140e43f8d5164949108f0dc483a2d181b2bda76f0e7f47202699a062342c0cf0bba8f2ae0f7c5411af9967ef051050a0 pve-cluster-dbgsym_6.1-8_amd64.deb I found one (unfortunately unrelated) bug in our error handling, so there's that at least ;) diff --git a/data/src/dfsm.c b/data/src/dfsm.c index 529c7f9..f3397a0 100644 --- a/data/src/dfsm.c +++ b/data/src/dfsm.c @@ -162,8 +162,8 @@ static void dfsm_send_sync_message_abort(dfsm_t *dfsm) { g_return_if_fail(dfsm != NULL); - g_mutex_lock (&dfsm->sync_mutex); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount); dfsm->msgcount_rcvd = dfsm->msgcount; g_cond_broadcast (&dfsm->sync_cond); g_mutex_unlock (&dfsm->sync_mutex); @@ -181,6 +181,7 @@ dfsm_record_local_result( g_mutex_lock (&dfsm->sync_mutex); dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count); + cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result); if (rp) { rp->result = msg_result; rp->processed = processed; @@ -224,6 +225,48 @@ loop: return result; } +static cs_error_t +dfsm_send_message_full_debug_state( + dfsm_t *dfsm, + struct iovec *iov, + unsigned int len, + int retry) +{ + g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); + g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + + struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; + cs_error_t result; + int retries = 0; + cfs_dom_message(dfsm->log_domain, "send state message debug"); + cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); + for (int i = 0; i < len; i++) + cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); +loop: + cfs_dom_message(dfsm->log_domain, "send state message loop body"); + + result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); + + cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); + if (retry && result == CS_ERR_TRY_AGAIN) { + nanosleep(&tvreq, NULL); + ++retries; + if ((retries % 10) == 0) + cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); + if (retries < 100) + goto loop; + } + + if (retries) + cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); + + if (result != CS_OK && + (!retry || result != CS_ERR_TRY_AGAIN)) + cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); + + return result; +} + static cs_error_t dfsm_send_state_message_full( dfsm_t *dfsm, @@ -235,6 +278,8 @@ dfsm_send_state_message_full( g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); + cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); + dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; @@ -252,7 +297,7 @@ dfsm_send_state_message_full( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; - return dfsm_send_message_full(dfsm, real_iov, len + 1, 1); + return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); } cs_error_t @@ -317,6 +362,7 @@ dfsm_send_message_sync( for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; + cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, count %" PRIu64 ", msgtype %d, len %d", msgcount, msgtype, len); cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1); g_mutex_unlock (&dfsm->sync_mutex); @@ -335,10 +381,12 @@ dfsm_send_message_sync( if (rp) { g_mutex_lock (&dfsm->sync_mutex); - while (dfsm->msgcount_rcvd < msgcount) + while (dfsm->msgcount_rcvd < msgcount) { + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount); g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex); + } + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!"); - g_hash_table_remove(dfsm->results, &rp->msgcount); g_mutex_unlock (&dfsm->sync_mutex); @@ -685,9 +733,13 @@ dfsm_cpg_deliver_callback( return; } + cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type); + if (base_header->type == DFSM_MESSAGE_NORMAL) { dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)", + base_header->type, base_header->subtype, msg_len); if (msg_len < sizeof(dfsm_message_normal_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)", @@ -704,6 +756,8 @@ dfsm_cpg_deliver_callback( } else { int msg_res = -1; + cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)", + header->count, base_header->subtype, msg_len); int res = dfsm->dfsm_callbacks->dfsm_deliver_fn( dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype, base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t), @@ -724,6 +778,8 @@ dfsm_cpg_deliver_callback( */ dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg; + cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d", + base_header->type, base_header->subtype, msg_len, mode); if (msg_len < sizeof(dfsm_message_state_header_t)) { cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)", @@ -744,6 +800,7 @@ dfsm_cpg_deliver_callback( if (mode == DFSM_MODE_SYNCED) { if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); for (int i = 0; i < dfsm->sync_info->node_count; i++) dfsm->sync_info->nodes[i].synced = 1; @@ -754,6 +811,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) { + cfs_dom_debug(dfsm->log_domain, "received verify request message"); if (msg_len != sizeof(dfsm->csum_counter)) { cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid); @@ -823,7 +881,6 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_START_SYNC) { if (base_header->type == DFSM_MESSAGE_SYNC_START) { - if (nodeid != dfsm->lowest_nodeid) { cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d", nodeid, pid); @@ -861,6 +918,7 @@ dfsm_cpg_deliver_callback( return; } else if (base_header->type == DFSM_MESSAGE_STATE) { + cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid); dfsm_node_info_t *ni; @@ -906,6 +964,8 @@ dfsm_cpg_deliver_callback( goto leave; } + } else { + cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more"); } return; @@ -914,6 +974,7 @@ dfsm_cpg_deliver_callback( } else if (mode == DFSM_MODE_UPDATE) { if (base_header->type == DFSM_MESSAGE_UPDATE) { + cfs_dom_debug(dfsm->log_domain, "received update message"); int res = dfsm->dfsm_callbacks->dfsm_process_update_fn( dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len); @@ -925,6 +986,7 @@ dfsm_cpg_deliver_callback( } else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) { + cfs_dom_debug(dfsm->log_domain, "received update complete message"); int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info); @@ -1104,6 +1166,7 @@ dfsm_cpg_confchg_callback( size_t joined_list_entries) { cs_error_t result; + cfs_debug("dfsm_cpg_confchg_callback called"); dfsm_t *dfsm = NULL; result = cpg_context_get(handle, (gpointer *)&dfsm); @@ -1190,7 +1253,7 @@ dfsm_cpg_confchg_callback( dfsm_set_mode(dfsm, DFSM_MODE_START_SYNC); if (lowest_nodeid == dfsm->nodeid) { - if (!dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0)) { + if (dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0) != CS_OK) { cfs_dom_critical(dfsm->log_domain, "failed to send SYNC_START message"); goto leave; } _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-25 16:29 ` Alexandre DERUMIER @ 2020-09-28 9:17 ` Fabian Grünbichler 2020-09-28 9:35 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Fabian Grünbichler @ 2020-09-28 9:17 UTC (permalink / raw) To: Proxmox VE development discussion On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote: > here a new hang: > > http://odisoweb1.odiso.net/test4/ okay, so at least we now know something strange inside pmxcfs is going on, and not inside corosync - we never reach the part where the broken node (again #13 in this hang!) sends out the state message: Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c:706:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's header type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message (type = 1, subtype = 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_sync_resources (dfsm.c:640:dfsm_release_sync_resources) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 00000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state) [...] Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_message_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full) this should be followed by output like this (from the -unblock log, where the sync went through): Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message debug (dfsm.c:241:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997161967616 (dfsm.c:242:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfsm.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message loop body (dfsm.c:246:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message result: 1 (dfsm.c:250:dfsm_send_message_full_debug_state) but this never happens. the relevant code from our patched dfsm.c (reordered): static cs_error_t dfsm_send_state_message_full( dfsm_t *dfsm, uint16_t type, struct iovec *iov, unsigned int len) { g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); // this message is still logged cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); // everything below this point might not have happened anymore dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; header.base.protocol_version = dfsm->protocol_version; header.base.time = time(NULL); header.base.reserved = 0; header.epoch = dfsm->sync_epoch; struct iovec real_iov[len + 1]; real_iov[0].iov_base = (char *)&header; real_iov[0].iov_len = sizeof(header); for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); } static cs_error_t dfsm_send_message_full_debug_state( dfsm_t *dfsm, struct iovec *iov, unsigned int len, int retry) { g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; cs_error_t result; int retries = 0; // this message is not visible in log // we don't know how far above this we managed to run cfs_dom_message(dfsm->log_domain, "send state message debug"); cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); for (int i = 0; i < len; i++) cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); loop: cfs_dom_message(dfsm->log_domain, "send state message loop body"); result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); if (retry && result == CS_ERR_TRY_AGAIN) { nanosleep(&tvreq, NULL); ++retries; if ((retries % 10) == 0) cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); if (retries < 100) goto loop; } if (retries) cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); if (result != CS_OK && (!retry || result != CS_ERR_TRY_AGAIN)) cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); return result; } I don't see much that could go wrong inside dfsm_send_state_message_full after the debug log statement (it's just filling out the header and iov structs, and calling 'dfsm_send_message_full_debug_state' since type == 2 == DFSM_MESSAGE_STATE). inside dfsm_send_message_full_debug, before the first log statement we only check for dfsm or the message length/content being NULL/0, all of which can't really happen with that call path. also, in that case we'd return CS_ERR_INVALID_PARAM , which would bubble up into the delivery callback and cause us to leave CPG, which would again be visible in the logs.. but, just to make sure, could you reproduce the issue once more, and then (with debug symbols installed) run $ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) on all nodes at the same time? this should minimize the fallout and show us whether the thread that logged the first part of sending the state is still around on the node that triggers the hang.. > Something new: where doing coredump or bt-full on pmxcfs on node1, > this have unlocked /etc/pve on other nodes > > /etc/pve unlocked(with coredump or bt-full): 17:57:40 this looks like you bt-ed corosync this time around? if so, than this is expected: - attach gdb to corosync - corosync blocks - other nodes notice corosync is gone on node X - config change triggered - sync restarts on all nodes, does not trigger bug this time ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-28 9:17 ` Fabian Grünbichler @ 2020-09-28 9:35 ` Alexandre DERUMIER 2020-09-28 15:59 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-28 9:35 UTC (permalink / raw) To: Proxmox VE development discussion >>but, just to make sure, could you reproduce the issue once more, and >>then (with debug symbols installed) run >> >>$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) >> >>on all nodes at the same time? this should minimize the fallout and show >>us whether the thread that logged the first part of sending the state is >>still around on the node that triggers the hang.. ok, no problem, I'll do a new test today >>this looks like you bt-ed corosync this time around? if so, than this is >>expected: >> >>- attach gdb to corosync >>- corosync blocks >>- other nodes notice corosync is gone on node X >>- config change triggered >>- sync restarts on all nodes, does not trigger bug this time ok, thanks, make sense. I didn't notice this previously, but this time it was with corosync started. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 28 Septembre 2020 11:17:37 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote: > here a new hang: > > http://odisoweb1.odiso.net/test4/ okay, so at least we now know something strange inside pmxcfs is going on, and not inside corosync - we never reach the part where the broken node (again #13 in this hang!) sends out the state message: Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c:706:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's header type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message (type = 1, subtype = 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_sync_resources (dfsm.c:640:dfsm_release_sync_resources) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 00000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state) [...] Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_message_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full) this should be followed by output like this (from the -unblock log, where the sync went through): Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message debug (dfsm.c:241:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997161967616 (dfsm.c:242:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfsm.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message loop body (dfsm.c:246:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message result: 1 (dfsm.c:250:dfsm_send_message_full_debug_state) but this never happens. the relevant code from our patched dfsm.c (reordered): static cs_error_t dfsm_send_state_message_full( dfsm_t *dfsm, uint16_t type, struct iovec *iov, unsigned int len) { g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); // this message is still logged cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); // everything below this point might not have happened anymore dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; header.base.protocol_version = dfsm->protocol_version; header.base.time = time(NULL); header.base.reserved = 0; header.epoch = dfsm->sync_epoch; struct iovec real_iov[len + 1]; real_iov[0].iov_base = (char *)&header; real_iov[0].iov_len = sizeof(header); for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); } static cs_error_t dfsm_send_message_full_debug_state( dfsm_t *dfsm, struct iovec *iov, unsigned int len, int retry) { g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; cs_error_t result; int retries = 0; // this message is not visible in log // we don't know how far above this we managed to run cfs_dom_message(dfsm->log_domain, "send state message debug"); cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); for (int i = 0; i < len; i++) cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); loop: cfs_dom_message(dfsm->log_domain, "send state message loop body"); result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); if (retry && result == CS_ERR_TRY_AGAIN) { nanosleep(&tvreq, NULL); ++retries; if ((retries % 10) == 0) cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); if (retries < 100) goto loop; } if (retries) cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); if (result != CS_OK && (!retry || result != CS_ERR_TRY_AGAIN)) cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); return result; } I don't see much that could go wrong inside dfsm_send_state_message_full after the debug log statement (it's just filling out the header and iov structs, and calling 'dfsm_send_message_full_debug_state' since type == 2 == DFSM_MESSAGE_STATE). inside dfsm_send_message_full_debug, before the first log statement we only check for dfsm or the message length/content being NULL/0, all of which can't really happen with that call path. also, in that case we'd return CS_ERR_INVALID_PARAM , which would bubble up into the delivery callback and cause us to leave CPG, which would again be visible in the logs.. but, just to make sure, could you reproduce the issue once more, and then (with debug symbols installed) run $ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) on all nodes at the same time? this should minimize the fallout and show us whether the thread that logged the first part of sending the state is still around on the node that triggers the hang.. > Something new: where doing coredump or bt-full on pmxcfs on node1, > this have unlocked /etc/pve on other nodes > > /etc/pve unlocked(with coredump or bt-full): 17:57:40 this looks like you bt-ed corosync this time around? if so, than this is expected: - attach gdb to corosync - corosync blocks - other nodes notice corosync is gone on node X - config change triggered - sync restarts on all nodes, does not trigger bug this time _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-28 9:35 ` Alexandre DERUMIER @ 2020-09-28 15:59 ` Alexandre DERUMIER 2020-09-29 5:30 ` Alexandre DERUMIER 2020-09-29 8:51 ` Fabian Grünbichler 0 siblings, 2 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-28 15:59 UTC (permalink / raw) To: Proxmox VE development discussion Here a new test http://odisoweb1.odiso.net/test5 This has occured at corosync start node1: ----- start corosync : 17:30:19 node2: /etc/pve locked -------------- Current time : 17:30:24 I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 and a coredump of all nodes at same time with parallel ssh at 17:42:26 (Note that this time, /etc/pve was still locked after backtrace/coredump) ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 28 Septembre 2020 11:35:00 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>but, just to make sure, could you reproduce the issue once more, and >>then (with debug symbols installed) run >> >>$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) >> >>on all nodes at the same time? this should minimize the fallout and show >>us whether the thread that logged the first part of sending the state is >>still around on the node that triggers the hang.. ok, no problem, I'll do a new test today >>this looks like you bt-ed corosync this time around? if so, than this is >>expected: >> >>- attach gdb to corosync >>- corosync blocks >>- other nodes notice corosync is gone on node X >>- config change triggered >>- sync restarts on all nodes, does not trigger bug this time ok, thanks, make sense. I didn't notice this previously, but this time it was with corosync started. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 28 Septembre 2020 11:17:37 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote: > here a new hang: > > http://odisoweb1.odiso.net/test4/ okay, so at least we now know something strange inside pmxcfs is going on, and not inside corosync - we never reach the part where the broken node (again #13 in this hang!) sends out the state message: Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c:706:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's header type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message (type = 1, subtype = 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_sync_resources (dfsm.c:640:dfsm_release_sync_resources) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 00000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state) [...] Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_message_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full) this should be followed by output like this (from the -unblock log, where the sync went through): Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message debug (dfsm.c:241:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997161967616 (dfsm.c:242:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfsm.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message loop body (dfsm.c:246:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message result: 1 (dfsm.c:250:dfsm_send_message_full_debug_state) but this never happens. the relevant code from our patched dfsm.c (reordered): static cs_error_t dfsm_send_state_message_full( dfsm_t *dfsm, uint16_t type, struct iovec *iov, unsigned int len) { g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); // this message is still logged cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); // everything below this point might not have happened anymore dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; header.base.protocol_version = dfsm->protocol_version; header.base.time = time(NULL); header.base.reserved = 0; header.epoch = dfsm->sync_epoch; struct iovec real_iov[len + 1]; real_iov[0].iov_base = (char *)&header; real_iov[0].iov_len = sizeof(header); for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); } static cs_error_t dfsm_send_message_full_debug_state( dfsm_t *dfsm, struct iovec *iov, unsigned int len, int retry) { g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; cs_error_t result; int retries = 0; // this message is not visible in log // we don't know how far above this we managed to run cfs_dom_message(dfsm->log_domain, "send state message debug"); cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); for (int i = 0; i < len; i++) cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); loop: cfs_dom_message(dfsm->log_domain, "send state message loop body"); result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); if (retry && result == CS_ERR_TRY_AGAIN) { nanosleep(&tvreq, NULL); ++retries; if ((retries % 10) == 0) cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); if (retries < 100) goto loop; } if (retries) cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); if (result != CS_OK && (!retry || result != CS_ERR_TRY_AGAIN)) cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); return result; } I don't see much that could go wrong inside dfsm_send_state_message_full after the debug log statement (it's just filling out the header and iov structs, and calling 'dfsm_send_message_full_debug_state' since type == 2 == DFSM_MESSAGE_STATE). inside dfsm_send_message_full_debug, before the first log statement we only check for dfsm or the message length/content being NULL/0, all of which can't really happen with that call path. also, in that case we'd return CS_ERR_INVALID_PARAM , which would bubble up into the delivery callback and cause us to leave CPG, which would again be visible in the logs.. but, just to make sure, could you reproduce the issue once more, and then (with debug symbols installed) run $ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) on all nodes at the same time? this should minimize the fallout and show us whether the thread that logged the first part of sending the state is still around on the node that triggers the hang.. > Something new: where doing coredump or bt-full on pmxcfs on node1, > this have unlocked /etc/pve on other nodes > > /etc/pve unlocked(with coredump or bt-full): 17:57:40 this looks like you bt-ed corosync this time around? if so, than this is expected: - attach gdb to corosync - corosync blocks - other nodes notice corosync is gone on node X - config change triggered - sync restarts on all nodes, does not trigger bug this time _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-28 15:59 ` Alexandre DERUMIER @ 2020-09-29 5:30 ` Alexandre DERUMIER 2020-09-29 8:51 ` Fabian Grünbichler 1 sibling, 0 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-29 5:30 UTC (permalink / raw) To: Proxmox VE development discussion also for test5, I have restarted corosync on node1 at 17:54:05, this have unlocked /etc/pve on other nodes I have submitted logs too : "corosync-restart-nodeX.log" ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 28 Septembre 2020 17:59:20 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Here a new test http://odisoweb1.odiso.net/test5 This has occured at corosync start node1: ----- start corosync : 17:30:19 node2: /etc/pve locked -------------- Current time : 17:30:24 I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 and a coredump of all nodes at same time with parallel ssh at 17:42:26 (Note that this time, /etc/pve was still locked after backtrace/coredump) ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 28 Septembre 2020 11:35:00 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>but, just to make sure, could you reproduce the issue once more, and >>then (with debug symbols installed) run >> >>$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) >> >>on all nodes at the same time? this should minimize the fallout and show >>us whether the thread that logged the first part of sending the state is >>still around on the node that triggers the hang.. ok, no problem, I'll do a new test today >>this looks like you bt-ed corosync this time around? if so, than this is >>expected: >> >>- attach gdb to corosync >>- corosync blocks >>- other nodes notice corosync is gone on node X >>- config change triggered >>- sync restarts on all nodes, does not trigger bug this time ok, thanks, make sense. I didn't notice this previously, but this time it was with corosync started. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 28 Septembre 2020 11:17:37 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote: > here a new hang: > > http://odisoweb1.odiso.net/test4/ okay, so at least we now know something strange inside pmxcfs is going on, and not inside corosync - we never reach the part where the broken node (again #13 in this hang!) sends out the state message: Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c:706:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's header type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message (type = 1, subtype = 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_sync_resources (dfsm.c:640:dfsm_release_sync_resources) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 00000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state) [...] Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_message_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full) this should be followed by output like this (from the -unblock log, where the sync went through): Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message debug (dfsm.c:241:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997161967616 (dfsm.c:242:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfsm.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message loop body (dfsm.c:246:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message result: 1 (dfsm.c:250:dfsm_send_message_full_debug_state) but this never happens. the relevant code from our patched dfsm.c (reordered): static cs_error_t dfsm_send_state_message_full( dfsm_t *dfsm, uint16_t type, struct iovec *iov, unsigned int len) { g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); // this message is still logged cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len); // everything below this point might not have happened anymore dfsm_message_state_header_t header; header.base.type = type; header.base.subtype = 0; header.base.protocol_version = dfsm->protocol_version; header.base.time = time(NULL); header.base.reserved = 0; header.epoch = dfsm->sync_epoch; struct iovec real_iov[len + 1]; real_iov[0].iov_base = (char *)&header; real_iov[0].iov_len = sizeof(header); for (int i = 0; i < len; i++) real_iov[i + 1] = iov[i]; return type == DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + 1, 1); } static cs_error_t dfsm_send_message_full_debug_state( dfsm_t *dfsm, struct iovec *iov, unsigned int len, int retry) { g_return_val_if_fail(dfsm != NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM); struct timespec tvreq = { .tv_sec = 0, .tv_nsec = 100000000 }; cs_error_t result; int retries = 0; // this message is not visible in log // we don't know how far above this we managed to run cfs_dom_message(dfsm->log_domain, "send state message debug"); cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle); for (int i = 0; i < len; i++) cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); loop: cfs_dom_message(dfsm->log_domain, "send state message loop body"); result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); cfs_dom_message(dfsm->log_domain, "send state message result: %d", result); if (retry && result == CS_ERR_TRY_AGAIN) { nanosleep(&tvreq, NULL); ++retries; if ((retries % 10) == 0) cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); if (retries < 100) goto loop; } if (retries) cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries); if (result != CS_OK && (!retry || result != CS_ERR_TRY_AGAIN)) cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result); return result; } I don't see much that could go wrong inside dfsm_send_state_message_full after the debug log statement (it's just filling out the header and iov structs, and calling 'dfsm_send_message_full_debug_state' since type == 2 == DFSM_MESSAGE_STATE). inside dfsm_send_message_full_debug, before the first log statement we only check for dfsm or the message length/content being NULL/0, all of which can't really happen with that call path. also, in that case we'd return CS_ERR_INVALID_PARAM , which would bubble up into the delivery callback and cause us to leave CPG, which would again be visible in the logs.. but, just to make sure, could you reproduce the issue once more, and then (with debug symbols installed) run $ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof pmxcfs) on all nodes at the same time? this should minimize the fallout and show us whether the thread that logged the first part of sending the state is still around on the node that triggers the hang.. > Something new: where doing coredump or bt-full on pmxcfs on node1, > this have unlocked /etc/pve on other nodes > > /etc/pve unlocked(with coredump or bt-full): 17:57:40 this looks like you bt-ed corosync this time around? if so, than this is expected: - attach gdb to corosync - corosync blocks - other nodes notice corosync is gone on node X - config change triggered - sync restarts on all nodes, does not trigger bug this time _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-28 15:59 ` Alexandre DERUMIER 2020-09-29 5:30 ` Alexandre DERUMIER @ 2020-09-29 8:51 ` Fabian Grünbichler 2020-09-29 9:37 ` Alexandre DERUMIER 1 sibling, 1 reply; 84+ messages in thread From: Fabian Grünbichler @ 2020-09-29 8:51 UTC (permalink / raw) To: Proxmox VE development discussion On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: > Here a new test http://odisoweb1.odiso.net/test5 > > This has occured at corosync start > > > node1: > ----- > start corosync : 17:30:19 > > > node2: /etc/pve locked > -------------- > Current time : 17:30:24 > > > I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 > > and a coredump of all nodes at same time with parallel ssh at 17:42:26 > > > (Note that this time, /etc/pve was still locked after backtrace/coredump) okay, so this time two more log lines got printed on the (again) problem causing node #13, but it still stops logging at a point where this makes no sense. I rebuilt the packages: f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb with a change of how the logging is set up (I now suspect that some messages might get dropped if the logging throughput is high enough), let's hope this gets us the information we need. please repeat the test5 again with these packages. is there anything special about node 13? network topology, slower hardware, ... ? ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-29 8:51 ` Fabian Grünbichler @ 2020-09-29 9:37 ` Alexandre DERUMIER 2020-09-29 10:52 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-29 9:37 UTC (permalink / raw) To: Proxmox VE development discussion >>with a change of how the logging is set up (I now suspect that some >>messages might get dropped if the logging throughput is high enough), >>let's hope this gets us the information we need. please repeat the test5 >>again with these packages. I'll test this afternoon >>is there anything special about node 13? network topology, slower >>hardware, ... ? no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk. this node is around 10% cpu usage, load is around 5. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 10:51:32 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: > Here a new test http://odisoweb1.odiso.net/test5 > > This has occured at corosync start > > > node1: > ----- > start corosync : 17:30:19 > > > node2: /etc/pve locked > -------------- > Current time : 17:30:24 > > > I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 > > and a coredump of all nodes at same time with parallel ssh at 17:42:26 > > > (Note that this time, /etc/pve was still locked after backtrace/coredump) okay, so this time two more log lines got printed on the (again) problem causing node #13, but it still stops logging at a point where this makes no sense. I rebuilt the packages: f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb with a change of how the logging is set up (I now suspect that some messages might get dropped if the logging throughput is high enough), let's hope this gets us the information we need. please repeat the test5 again with these packages. is there anything special about node 13? network topology, slower hardware, ... ? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-29 9:37 ` Alexandre DERUMIER @ 2020-09-29 10:52 ` Alexandre DERUMIER 2020-09-29 11:43 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-29 10:52 UTC (permalink / raw) To: Proxmox VE development discussion here a new test: http://odisoweb1.odiso.net/test6/ node1 ----- start corosync : 12:08:33 node2 (/etc/pve lock) ----- Current time : 12:08:39 node1 (stop corosync : unlock /etc/pve) ----- 12:28:11 : systemctl stop corosync backtraces: 12:26:30 coredump : 12:27:21 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 11:37:41 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>with a change of how the logging is set up (I now suspect that some >>messages might get dropped if the logging throughput is high enough), >>let's hope this gets us the information we need. please repeat the test5 >>again with these packages. I'll test this afternoon >>is there anything special about node 13? network topology, slower >>hardware, ... ? no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk. this node is around 10% cpu usage, load is around 5. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 10:51:32 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: > Here a new test http://odisoweb1.odiso.net/test5 > > This has occured at corosync start > > > node1: > ----- > start corosync : 17:30:19 > > > node2: /etc/pve locked > -------------- > Current time : 17:30:24 > > > I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 > > and a coredump of all nodes at same time with parallel ssh at 17:42:26 > > > (Note that this time, /etc/pve was still locked after backtrace/coredump) okay, so this time two more log lines got printed on the (again) problem causing node #13, but it still stops logging at a point where this makes no sense. I rebuilt the packages: f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb with a change of how the logging is set up (I now suspect that some messages might get dropped if the logging throughput is high enough), let's hope this gets us the information we need. please repeat the test5 again with these packages. is there anything special about node 13? network topology, slower hardware, ... ? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-29 10:52 ` Alexandre DERUMIER @ 2020-09-29 11:43 ` Alexandre DERUMIER 2020-09-29 11:50 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-29 11:43 UTC (permalink / raw) To: Proxmox VE development discussion >> >>node1 (stop corosync : unlock /etc/pve) >>----- >>12:28:11 : systemctl stop corosync sorry, this was wrong,I need to start corosync after the stop to get it working again I'll reupload theses logs ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 12:52:44 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown here a new test: http://odisoweb1.odiso.net/test6/ node1 ----- start corosync : 12:08:33 node2 (/etc/pve lock) ----- Current time : 12:08:39 node1 (stop corosync : unlock /etc/pve) ----- 12:28:11 : systemctl stop corosync backtraces: 12:26:30 coredump : 12:27:21 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 11:37:41 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>with a change of how the logging is set up (I now suspect that some >>messages might get dropped if the logging throughput is high enough), >>let's hope this gets us the information we need. please repeat the test5 >>again with these packages. I'll test this afternoon >>is there anything special about node 13? network topology, slower >>hardware, ... ? no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk. this node is around 10% cpu usage, load is around 5. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 10:51:32 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: > Here a new test http://odisoweb1.odiso.net/test5 > > This has occured at corosync start > > > node1: > ----- > start corosync : 17:30:19 > > > node2: /etc/pve locked > -------------- > Current time : 17:30:24 > > > I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 > > and a coredump of all nodes at same time with parallel ssh at 17:42:26 > > > (Note that this time, /etc/pve was still locked after backtrace/coredump) okay, so this time two more log lines got printed on the (again) problem causing node #13, but it still stops logging at a point where this makes no sense. I rebuilt the packages: f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb with a change of how the logging is set up (I now suspect that some messages might get dropped if the logging throughput is high enough), let's hope this gets us the information we need. please repeat the test5 again with these packages. is there anything special about node 13? network topology, slower hardware, ... ? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-29 11:43 ` Alexandre DERUMIER @ 2020-09-29 11:50 ` Alexandre DERUMIER 2020-09-29 13:28 ` Fabian Grünbichler 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-29 11:50 UTC (permalink / raw) To: Proxmox VE development discussion I have reuploaded the logs node1 ----- start corosync : 12:08:33 (corosync.log) node2 (/etc/pve lock) ----- Current time : 12:08:39 node1 (stop corosync : ---> not unlocked) (corosync-stop.log) ----- 12:28:11 : systemctl stop corosync node2 (start corosync: ----> /etc/pve unlocked (corosync-start.log) ------------------------------------------------ 13:41:16 : systemctl start corosync ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 13:43:08 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >> >>node1 (stop corosync : unlock /etc/pve) >>----- >>12:28:11 : systemctl stop corosync sorry, this was wrong,I need to start corosync after the stop to get it working again I'll reupload theses logs ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 12:52:44 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown here a new test: http://odisoweb1.odiso.net/test6/ node1 ----- start corosync : 12:08:33 node2 (/etc/pve lock) ----- Current time : 12:08:39 node1 (stop corosync : unlock /etc/pve) ----- 12:28:11 : systemctl stop corosync backtraces: 12:26:30 coredump : 12:27:21 ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 11:37:41 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>with a change of how the logging is set up (I now suspect that some >>messages might get dropped if the logging throughput is high enough), >>let's hope this gets us the information we need. please repeat the test5 >>again with these packages. I'll test this afternoon >>is there anything special about node 13? network topology, slower >>hardware, ... ? no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk. this node is around 10% cpu usage, load is around 5. ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 10:51:32 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: > Here a new test http://odisoweb1.odiso.net/test5 > > This has occured at corosync start > > > node1: > ----- > start corosync : 17:30:19 > > > node2: /etc/pve locked > -------------- > Current time : 17:30:24 > > > I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 > > and a coredump of all nodes at same time with parallel ssh at 17:42:26 > > > (Note that this time, /etc/pve was still locked after backtrace/coredump) okay, so this time two more log lines got printed on the (again) problem causing node #13, but it still stops logging at a point where this makes no sense. I rebuilt the packages: f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb with a change of how the logging is set up (I now suspect that some messages might get dropped if the logging throughput is high enough), let's hope this gets us the information we need. please repeat the test5 again with these packages. is there anything special about node 13? network topology, slower hardware, ... ? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-29 11:50 ` Alexandre DERUMIER @ 2020-09-29 13:28 ` Fabian Grünbichler 2020-09-29 13:52 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Fabian Grünbichler @ 2020-09-29 13:28 UTC (permalink / raw) To: Proxmox VE development discussion huge thanks for all the work on this btw! I think I've found a likely culprit (a missing lock around a non-thread-safe corosync library call) based on the last logs (which were now finally complete!). rebuilt packages with a proof-of-concept-fix: 23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7 pve-cluster_6.1-8_amd64.deb 9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef pve-cluster-dbgsym_6.1-8_amd64.deb I removed some logging statements which are no longer needed, so output is a bit less verbose again. if you are not able to trigger the issue with this package, feel free to remove the -debug and let it run for a little longer without the massive logs. if feedback from your end is positive, I'll whip up a proper patch tomorrow or on Thursday. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-29 13:28 ` Fabian Grünbichler @ 2020-09-29 13:52 ` Alexandre DERUMIER 2020-09-30 6:09 ` Alexandre DERUMIER 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-29 13:52 UTC (permalink / raw) To: Proxmox VE development discussion >>huge thanks for all the work on this btw! huge thanks to you ! ;) >>I think I've found a likely culprit (a missing lock around a >>non-thread-safe corosync library call) based on the last logs (which >>were now finally complete!). YES :) >>if feedback from your end is positive, I'll whip up a proper patch >>tomorrow or on Thursday. I'm going to launch a new test right now ! ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 15:28:19 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown huge thanks for all the work on this btw! I think I've found a likely culprit (a missing lock around a non-thread-safe corosync library call) based on the last logs (which were now finally complete!). rebuilt packages with a proof-of-concept-fix: 23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7 pve-cluster_6.1-8_amd64.deb 9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef pve-cluster-dbgsym_6.1-8_amd64.deb I removed some logging statements which are no longer needed, so output is a bit less verbose again. if you are not able to trigger the issue with this package, feel free to remove the -debug and let it run for a little longer without the massive logs. if feedback from your end is positive, I'll whip up a proper patch tomorrow or on Thursday. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-29 13:52 ` Alexandre DERUMIER @ 2020-09-30 6:09 ` Alexandre DERUMIER 2020-09-30 6:26 ` Thomas Lamprecht 0 siblings, 1 reply; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-30 6:09 UTC (permalink / raw) To: Proxmox VE development discussion Hi, some news, my last test is running for 14h now, and I don't have had any problem :) So, it seem that is indeed fixed ! Congratulations ! I wonder if it could be related to this forum user https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/ His problem is that after corosync lag (he's have 1 cluster stretch on 2DC with 10km distance, so I think sometimes he's having some small lag, 1 node is flooding other nodes with a lot of udp packets. (and making things worst, as corosync cpu is going to 100% / overloaded, and then can't see other onodes I had this problem 6month ago after shutting down a node, that's why I'm thinking it could "maybe" related. So, I wonder if it could be same pmxcfs bug, when something looping or send again again packets. The forum user seem to have the problem multiple times in some week, so maybe he'll be able to test the new fixed pmxcs, and tell us if it's fixing this bug too. ----- Mail original ----- De: "aderumier" <aderumier@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 15:52:18 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>huge thanks for all the work on this btw! huge thanks to you ! ;) >>I think I've found a likely culprit (a missing lock around a >>non-thread-safe corosync library call) based on the last logs (which >>were now finally complete!). YES :) >>if feedback from your end is positive, I'll whip up a proper patch >>tomorrow or on Thursday. I'm going to launch a new test right now ! ----- Mail original ----- De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 29 Septembre 2020 15:28:19 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown huge thanks for all the work on this btw! I think I've found a likely culprit (a missing lock around a non-thread-safe corosync library call) based on the last logs (which were now finally complete!). rebuilt packages with a proof-of-concept-fix: 23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7 pve-cluster_6.1-8_amd64.deb 9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef pve-cluster-dbgsym_6.1-8_amd64.deb I removed some logging statements which are no longer needed, so output is a bit less verbose again. if you are not able to trigger the issue with this package, feel free to remove the -debug and let it run for a little longer without the massive logs. if feedback from your end is positive, I'll whip up a proper patch tomorrow or on Thursday. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-30 6:09 ` Alexandre DERUMIER @ 2020-09-30 6:26 ` Thomas Lamprecht 0 siblings, 0 replies; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-30 6:26 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER Hi, On 30.09.20 08:09, Alexandre DERUMIER wrote: > some news, my last test is running for 14h now, and I don't have had any problem :) > great! Thanks for all your testing time, this would have been much harder, if even possible at all, without you probiving so much testing effort on a production(!) cluster - appreciated! Naturally many thanks to Fabian too, for reading so many logs without going insane :-) > So, it seem that is indeed fixed ! Congratulations ! > honza comfirmed Fabians suspicion about lacking guarantees of thread safety for cpg_mcast_joined, which was sadly not documented, so this is surely a bug, let's hope the last of such hard to reproduce ones. > > > I wonder if it could be related to this forum user > https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/ > > His problem is that after corosync lag (he's have 1 cluster stretch on 2DC with 10km distance, so I think sometimes he's having some small lag, > 1 node is flooding other nodes with a lot of udp packets. (and making things worst, as corosync cpu is going to 100% / overloaded, and then can't see other onodes I can imagine this problem showing up as a a side effect of a flood where partition changes happen. Not so sure that this can be the cause of that directly. > > I had this problem 6month ago after shutting down a node, that's why I'm thinking it could "maybe" related. > > So, I wonder if it could be same pmxcfs bug, when something looping or send again again packets. > > The forum user seem to have the problem multiple times in some week, so maybe he'll be able to test the new fixed pmxcs, and tell us if it's fixing this bug too. Testing once available would be sure a good idea for them. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-15 6:27 ` Alexandre DERUMIER 2020-09-15 7:13 ` dietmar @ 2020-09-15 7:58 ` Thomas Lamprecht 1 sibling, 0 replies; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-15 7:58 UTC (permalink / raw) To: Alexandre DERUMIER, dietmar; +Cc: Proxmox VE development discussion On 9/15/20 8:27 AM, Alexandre DERUMIER wrote: >>> This is by intention - we do not want to stop pmxcfs only because coorosync service stops. > > Yes, but at shutdown, it could be great to stop pmxcfs before corosync ? > I ask the question, because the 2 times I have problem, it was when shutting down a server. > So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ? > > > looking at the pve-cluster unit file, > why do we have "Before=corosync.service" and not "After=corosync.service" ? We may need to sync over the cluster corosync.conf to the local one, that can only happen before. Also, if we shutdown pmxcfs before corosync we may still get corosync events (file writes, locking, ...) but the node does not sees it locally anymore but still looks quorate for others, that'd be not good. > > I have tried to change this, but even with that, both are still shutting down in parallel. > > the only way I have found to have clean shutdown, is "Requires=corosync.server" + "After=corosync.service". > But that mean than if you restart corosync, it's restart pmxcfs too first. > > I have looked at systemd doc, After= should be enough (as at shutdown it's doing the reverse order), > but I don't known why corosync don't wait than pve-cluster ??? > > > (Also, I think than pmxcfs is also stopping after syslog, because I never see the pmxcfs "teardown filesystem" logs at shutdown) is that true for (persistent) systemd-journald too? IIRC syslog.target is deprecated and only rsyslog provides it. As the next Debian will enable persistent journal by default and we already use it for everything (IIRC) were we provide an interface to logs, we will probably not enable rsyslog by default with PVE 7.x But if we can add some ordering for this to be improved I'm open for it. ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-04 12:29 ` Alexandre DERUMIER 2020-09-04 15:42 ` Dietmar Maurer @ 2020-12-29 14:21 ` Josef Johansson 1 sibling, 0 replies; 84+ messages in thread From: Josef Johansson @ 2020-12-29 14:21 UTC (permalink / raw) To: pve-devel Hi, On 9/4/20 2:29 PM, Alexandre DERUMIER wrote: > BTW, > > do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ? > > I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..) Using corosync to enable HA is a bit scary TBH, not sure how people solve this, but I'd rather have the logic outside the cluster so it can make a solid decision whether a node should be rebooted or not, and choose to reboot via iDRAC if all checks fail. Maybe use the metrics support to facilitate this? > Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? > > > > ----- Mail original ----- > De: "aderumier" <aderumier@odiso.com> > À: "pve-devel" <pve-devel@pve.proxmox.com> > Envoyé: Jeudi 3 Septembre 2020 16:11:56 > Objet: [pve-devel] corosync bug: cluster break after 1 node clean shutdown > > Hi, > > I had a problem this morning with corosync, after shutting down cleanly a node from the cluster. > This is a 14 nodes cluster, node2 was shutted. (with "halt" command) > > HA was actived, and all nodes have been rebooted. > (I have already see this problem some months ago, without HA enabled, and it was like half of the nodes didn't see others nodes, like 2 cluster formations) > > Some users have reported similar problems when adding a new node > https://forum.proxmox.com/threads/quorum-lost-when-adding-new-7th-node.75197/ > > > Here the full logs of the nodes, for each node I'm seeing quorum with 13 nodes. (so it's seem to be ok). > I didn't have time to look live on server, as HA have restarted them. > So, I don't known if it's a corosync bug or something related to crm,lrm,pmxcs (but I don't see any special logs) > > Only node7 have survived a little bit longer,and I had stop lrm/crm to avoid reboot. > > libknet1: 1.15-pve1 > corosync: 3.0.3-pve1 > > > Any ideas before submiting bug to corosync mailing ? (I'm seeing new libknet && corosync version on their github, I don't have read the changelog yet) > > > > > node1 > ----- > Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm1 pmxcfs[3527]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm1 corosync[3678]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm1 corosync[3678]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm1 corosync[3678]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [dcdb] notice: cpg_send_message retried 1 times > Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: received all states > Sep 3 10:39:17 m6kvm1 pmxcfs[3527]: [status] notice: all data is up to date > Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm1 corosync[3678]: [KNET ] host: host: 2 has no active links > > --> reboot > > node2 : shutdown log > ----- > Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd received signal 15: Terminated > Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302Y9120CGN.ata.state > Sep 3 10:39:05 m6kvm2 smartd[27390]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.INTEL_SSDSC2BB120G6R-PHWA640302MG120CGN.ata.state > Sep 3 10:39:05 m6kvm2 smartd[27390]: smartd is exiting (exit status 0) > Sep 3 10:39:05 m6kvm2 nrpe[20095]: Caught SIGTERM - shutting down... > Sep 3 10:39:05 m6kvm2 nrpe[20095]: Daemon shutdown > Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> starting task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: > Sep 3 10:39:07 m6kvm2 pve-guests[38552]: all VMs and CTs stopped > Sep 3 10:39:07 m6kvm2 pve-guests[38550]: <root@pam> end task UPID:m6kvm2:00009698:93A98FFD:5F50ABAB:stopall::root@pam: OK > Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: received signal TERM > Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server closing > Sep 3 10:39:07 m6kvm2 spiceproxy[36572]: worker exit > Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: worker 36572 finished > Sep 3 10:39:07 m6kvm2 spiceproxy[3847]: server stopped > Sep 3 10:39:07 m6kvm2 pvestatd[3786]: received signal TERM > Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server closing > Sep 3 10:39:07 m6kvm2 pvestatd[3786]: server stopped > Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: received signal TERM > Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: got shutdown request with shutdown policy 'conditional' > Sep 3 10:39:07 m6kvm2 pve-ha-lrm[12768]: shutdown LRM, stop all services > Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: received terminate request (signal) > Sep 3 10:39:08 m6kvm2 pvefw-logger[36670]: stopping pvefw logger > Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: watchdog closed (disabled) > Sep 3 10:39:08 m6kvm2 pve-ha-lrm[12768]: server stopped > Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: received signal TERM > Sep 3 10:39:10 m6kvm2 pve-ha-crm[12847]: server received shutdown request > Sep 3 10:39:10 m6kvm2 pveproxy[24735]: received signal TERM > Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server closing > Sep 3 10:39:10 m6kvm2 pveproxy[29731]: worker exit > Sep 3 10:39:10 m6kvm2 pveproxy[30873]: worker exit > Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 30873 finished > Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 31391 finished > Sep 3 10:39:10 m6kvm2 pveproxy[24735]: worker 29731 finished > Sep 3 10:39:10 m6kvm2 pveproxy[24735]: server stopped > Sep 3 10:39:14 m6kvm2 pve-ha-crm[12847]: server stopped > > > node3 > ----- > Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm3 corosync[30580]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm3 corosync[30580]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm3 corosync[30580]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [dcdb] notice: cpg_send_message retried 1 times > Sep 3 10:39:16 m6kvm3 pmxcfs[27373]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: received all states > Sep 3 10:39:17 m6kvm3 pmxcfs[27373]: [status] notice: all data is up to date > Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm3 corosync[30580]: [KNET ] host: host: 2 has no active links > > > node4 > ----- > Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm4 pmxcfs[3903]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm4 corosync[4085]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm4 corosync[4085]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm4 corosync[4085]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: received all states > Sep 3 10:39:17 m6kvm4 pmxcfs[3903]: [status] notice: all data is up to date > Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:21 m6kvm4 corosync[4085]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:19 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:20 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:21 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log > Sep 3 10:40:22 m6kvm4 pmxcfs[3903]: [status] notice: received log > > node5 > ----- > Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm5 pmxcfs[42272]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm5 corosync[41830]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm5 corosync[41830]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm5 corosync[41830]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: received all states > Sep 3 10:39:17 m6kvm5 pmxcfs[42272]: [status] notice: all data is up to date > Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm5 corosync[41830]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:19 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:20 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:21 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log > Sep 3 10:40:22 m6kvm5 pmxcfs[42272]: [status] notice: received log > > > > > > node6 > ----- > Sep 3 10:39:16 m6kvm6 corosync[36694]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm6 corosync[36694]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm6 corosync[36694]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm6 corosync[36694]: [KNET ] host: host: 2 has no active links > > > > node7 > ----- > Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm7 corosync[15467]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: received all states > Sep 3 10:39:17 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date > Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:19 m6kvm7 corosync[15467]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:19 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:20 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:21 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:40:22 m6kvm7 pmxcfs[15892]: [status] notice: received log > ---> here the others nodes reboot almost at the same time > Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 3 link: 0 is down > Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] link: host: 14 link: 0 is down > Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) > Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 3 has no active links > Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 (passive) best link: 0 (pri: 1) > Sep 3 10:40:25 m6kvm7 corosync[15467]: [KNET ] host: host: 14 has no active links > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 13 link: 0 is down > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 8 link: 0 is down > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 6 link: 0 is down > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] link: host: 10 link: 0 is down > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1) > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 13 has no active links > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 8 has no active links > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1) > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 6 has no active links > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1) > Sep 3 10:40:27 m6kvm7 corosync[15467]: [KNET ] host: host: 10 has no active links > Sep 3 10:40:28 m6kvm7 corosync[15467]: [TOTEM ] Token has not been received in 4505 ms > Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 11 link: 0 is down > Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] link: host: 9 link: 0 is down > Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1) > Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 11 has no active links > Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1) > Sep 3 10:40:29 m6kvm7 corosync[15467]: [KNET ] host: host: 9 has no active links > Sep 3 10:40:30 m6kvm7 corosync[15467]: [TOTEM ] A processor failed, forming new configuration. > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 4 link: 0 is down > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 12 link: 0 is down > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] link: host: 1 link: 0 is down > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 4 has no active links > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 (passive) best link: 0 (pri: 1) > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 12 has no active links > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Sep 3 10:40:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 has no active links > Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] link: host: 5 link: 0 is down > Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1) > Sep 3 10:40:34 m6kvm7 corosync[15467]: [KNET ] host: host: 5 has no active links > Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] A new membership (7.82f) was formed. Members left: 1 3 4 5 6 8 9 10 11 12 13 14 > Sep 3 10:40:41 m6kvm7 corosync[15467]: [TOTEM ] Failed to receive the leave message. failed: 1 3 4 5 6 8 9 10 11 12 13 14 > Sep 3 10:40:41 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 12 received > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 > Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] This node is within the non-primary component and will NOT provide any services. > Sep 3 10:40:41 m6kvm7 corosync[15467]: [QUORUM] Members[1]: 7 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date > Sep 3 10:40:41 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: dfsm_deliver_queue: queue length 51 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 1/3527 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 4/3903 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 9/22810 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 6/37248 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 11/12983 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 10/41940 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 5/42272 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 13/39678 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 8/24790 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 3/27373 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 12/44214 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] notice: remove message from non-member 14/42930 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: node lost quorum > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [status] notice: members: 7/15892 > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: received write while not quorate - trigger resync > Sep 3 10:40:41 m6kvm7 pmxcfs[15892]: [dcdb] crit: leaving CPG group > Sep 3 10:40:41 m6kvm7 pve-ha-crm[16196]: node 'm6kvm2': state changed from 'online' => 'unknown' > Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] notice: start cluster connection > Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: cpg_join failed: 14 > Sep 3 10:40:42 m6kvm7 pmxcfs[15892]: [dcdb] crit: can't initialize service > Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 7/15892 > Sep 3 10:40:48 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date > Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied > Sep 3 10:40:56 m6kvm7 pve-ha-lrm[16140]: status change active => lost_agent_lock > Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change master => lost_manager_lock > Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: watchdog closed (disabled) > Sep 3 10:40:56 m6kvm7 pve-ha-crm[16196]: status change lost_manager_lock => wait_for_quorum > Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] rx: host: 4 link: 0 is up > Sep 3 10:43:23 m6kvm7 corosync[15467]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1) > Sep 3 10:43:24 m6kvm7 corosync[15467]: [TOTEM ] A new membership (4.834) was formed. Members joined: 4 > Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:43:24 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 4/3965, 7/15892 > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: members: 4/3965, 7/15892 > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation > Sep 3 10:43:24 m6kvm7 corosync[15467]: [QUORUM] Members[2]: 4 7 > Sep 3 10:43:24 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 4/3965/00000002) > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 4/3965/00000002) > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 7/15892 > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 7/15892 > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: start sending inode updates > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: sent all (4) updates > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: received all states > Sep 3 10:43:24 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date > Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] rx: host: 3 link: 0 is up > Sep 3 10:44:05 m6kvm7 corosync[15467]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) > Sep 3 10:44:05 m6kvm7 corosync[15467]: [TOTEM ] A new membership (3.838) was formed. Members joined: 3 > Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:44:05 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 3/3613, 4/3965, 7/15892 > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: members: 3/3613, 4/3965, 7/15892 > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation > Sep 3 10:44:05 m6kvm7 corosync[15467]: [QUORUM] Members[3]: 3 4 7 > Sep 3 10:44:05 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 3/3613/00000002) > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 3/3613/00000002) > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 4/3965 > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 4/3965, 7/15892 > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: received all states > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date > Sep 3 10:44:05 m6kvm7 pmxcfs[15892]: [status] notice: dfsm_deliver_queue: queue length 1 > Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] rx: host: 1 link: 0 is up > Sep 3 10:44:31 m6kvm7 corosync[15467]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Sep 3 10:44:31 m6kvm7 corosync[15467]: [TOTEM ] A new membership (1.83c) was formed. Members joined: 1 > Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:44:31 m6kvm7 corosync[15467]: [CPG ] downlist left_list: 0 received > Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 > Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [dcdb] notice: starting data syncronisation > Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: members: 1/3552, 3/3613, 4/3965, 7/15892 > Sep 3 10:44:31 m6kvm7 pmxcfs[15892]: [status] notice: starting data syncronisation > Sep 3 10:44:31 m6kvm7 corosync[15467]: [QUORUM] Members[4]: 1 3 4 7 > Sep 3 10:44:31 m6kvm7 corosync[15467]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received sync request (epoch 1/3552/00000002) > Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received sync request (epoch 1/3552/00000002) > Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: received all states > Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: leader is 3/3613 > Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: synced members: 3/3613, 4/3965, 7/15892 > Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [dcdb] notice: all data is up to date > Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: received all states > Sep 3 10:44:32 m6kvm7 pmxcfs[15892]: [status] notice: all data is up to date > Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:46:33 m6kvm7 pmxcfs[15892]: [status] notice: received log > Sep 3 10:49:18 m6kvm7 pmxcfs[15892]: [status] notice: received log > > > > > node8 > ----- > > Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm8 pmxcfs[24790]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm8 corosync[24361]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm8 corosync[24361]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm8 corosync[24361]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: received all states > Sep 3 10:39:17 m6kvm8 pmxcfs[24790]: [status] notice: all data is up to date > Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm8 corosync[24361]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log > Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log > Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log > Sep 3 10:40:19 m6kvm8 pmxcfs[24790]: [status] notice: received log > > > --> reboot > > > > node9 > ----- > Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm9 pmxcfs[22810]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm9 corosync[22340]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm9 corosync[22340]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm9 corosync[22340]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: received all states > Sep 3 10:39:17 m6kvm9 pmxcfs[22810]: [status] notice: all data is up to date > Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:21 m6kvm9 corosync[22340]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:19 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:20 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:21 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log > Sep 3 10:40:22 m6kvm9 pmxcfs[22810]: [status] notice: received log > > > --> reboot > > > > node10 > ------ > Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm10 pmxcfs[41940]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm10 corosync[41458]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm10 corosync[41458]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm10 corosync[41458]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: received all states > Sep 3 10:39:17 m6kvm10 pmxcfs[41940]: [status] notice: all data is up to date > Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:21 m6kvm10 corosync[41458]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log > Sep 3 10:40:19 m6kvm10 pmxcfs[41940]: [status] notice: received log > > --> reboot > > > node11 > ------ > Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm11 pmxcfs[12983]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm11 corosync[12455]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm11 corosync[12455]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm11 corosync[12455]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: received all states > Sep 3 10:39:17 m6kvm11 pmxcfs[12983]: [status] notice: all data is up to date > Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm11 corosync[12455]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log > Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log > Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log > Sep 3 10:40:19 m6kvm11 pmxcfs[12983]: [status] notice: received log > Sep 3 10:40:20 m6kvm11 pmxcfs[12983]: [status] notice: received log > > > > > node12 > ------ > Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm12 pmxcfs[44214]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm12 corosync[43716]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm12 corosync[43716]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm12 corosync[43716]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [dcdb] notice: cpg_send_message retried 1 times > Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: received all states > Sep 3 10:39:17 m6kvm12 pmxcfs[44214]: [status] notice: all data is up to date > Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm12 corosync[43716]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:19 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:20 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:21 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log > Sep 3 10:40:22 m6kvm12 pmxcfs[44214]: [status] notice: received log > > > > > > node13 > ------ > Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm13 corosync[39182]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm13 corosync[39182]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm13 corosync[39182]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:16 m6kvm13 pmxcfs[39678]: [dcdb] notice: cpg_send_message retried 1 times > Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: received all states > Sep 3 10:39:17 m6kvm13 pmxcfs[39678]: [status] notice: all data is up to date > Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:19 m6kvm13 corosync[39182]: [KNET ] host: host: 2 has no active links > Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log > Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log > Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log > Sep 3 10:40:19 m6kvm13 pmxcfs[39678]: [status] notice: received log > --> reboot > > node14 > ------ > Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: members: 1/3527, 3/27373, 4/3903, 5/42272, 6/37248, 7/15892, 8/24790, 9/22810, 10/41940, 11/12983, 12/44214, 13/39678, 14/42930 > Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: starting data syncronisation > Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [dcdb] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm14 pmxcfs[42930]: [status] notice: received sync request (epoch 1/3527/00000002) > Sep 3 10:39:16 m6kvm14 corosync[42413]: [TOTEM ] A new membership (1.82b) was formed. Members left: 2 > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [CPG ] downlist left_list: 1 received > Sep 3 10:39:16 m6kvm14 corosync[42413]: [QUORUM] Members[13]: 1 3 4 5 6 7 8 9 10 11 12 13 14 > Sep 3 10:39:16 m6kvm14 corosync[42413]: [MAIN ] Completed service synchronization, ready to provide service. > Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: received all states > Sep 3 10:39:17 m6kvm14 pmxcfs[42930]: [status] notice: all data is up to date > Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] link: host: 2 link: 0 is down > Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Sep 3 10:39:20 m6kvm14 corosync[42413]: [KNET ] host: host: 2 has no active links > > --> reboot > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel -- Med vänliga hälsningar Josef Johansson ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-03 14:11 [pve-devel] corosync bug: cluster break after 1 node clean shutdown Alexandre DERUMIER 2020-09-04 12:29 ` Alexandre DERUMIER @ 2020-09-04 15:46 ` Alexandre DERUMIER 2020-09-30 15:50 ` Thomas Lamprecht 2 siblings, 0 replies; 84+ messages in thread From: Alexandre DERUMIER @ 2020-09-04 15:46 UTC (permalink / raw) To: pve-devel What I'm not sure, is what was the running libknet version. package was upgraded everywhere to 1.16, but I have notice that corosync process is not restarted when libknet is upgraded. so it's quite possible that corosync was still running 1.13 libknet. (I need to find a way to find last corosync or node restart). I think we should force corosync restart on libknet upgrade. (or maybe bump corosync package version at the same time) ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-03 14:11 [pve-devel] corosync bug: cluster break after 1 node clean shutdown Alexandre DERUMIER 2020-09-04 12:29 ` Alexandre DERUMIER 2020-09-04 15:46 ` Alexandre DERUMIER @ 2020-09-30 15:50 ` Thomas Lamprecht 2020-10-15 9:16 ` Eneko Lacunza 2 siblings, 1 reply; 84+ messages in thread From: Thomas Lamprecht @ 2020-09-30 15:50 UTC (permalink / raw) To: Proxmox VE development discussion, Alexandre DERUMIER Hi, FYI: pve-cluster 6.2-1 is now available on pvetest, it includes the slightly modified patch from Fabian. cheers, Thomas ^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 2020-09-30 15:50 ` Thomas Lamprecht @ 2020-10-15 9:16 ` Eneko Lacunza 0 siblings, 0 replies; 84+ messages in thread From: Eneko Lacunza @ 2020-10-15 9:16 UTC (permalink / raw) To: pve-devel Hi all, I'm just a lurker on this list, but wanted to send a big THANK YOU to all involved in this fix, even with this 15-day lag. :-) I think some of our clients have been affected by this on production clusters, so I hope this will improve their cluster stability. Normally new features are the prettiest things, but having a robust and dependable system - that has no price!! Thanks a lot!! Eneko El 30/9/20 a las 17:50, Thomas Lamprecht escribió: > Hi, > > FYI: pve-cluster 6.2-1 is now available on pvetest, it includes the slightly modified > patch from Fabian. > > cheers, > Thomas > > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > -- Eneko Lacunza | +34 943 569 206 | elacunza@binovo.es Zuzendari teknikoa | https://www.binovo.es Director técnico | Astigarragako Bidea, 2 - 2º izda. BINOVO IT HUMAN PROJECT S.L | oficina 10-11, 20180 Oiartzun ^ permalink raw reply [flat|nested] 84+ messages in thread
end of thread, other threads:[~2020-12-29 14:51 UTC | newest] Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-09-03 14:11 [pve-devel] corosync bug: cluster break after 1 node clean shutdown Alexandre DERUMIER 2020-09-04 12:29 ` Alexandre DERUMIER 2020-09-04 15:42 ` Dietmar Maurer 2020-09-05 13:32 ` Alexandre DERUMIER 2020-09-05 15:23 ` dietmar 2020-09-05 17:30 ` Alexandre DERUMIER 2020-09-06 4:21 ` dietmar 2020-09-06 5:36 ` Alexandre DERUMIER 2020-09-06 6:33 ` Alexandre DERUMIER 2020-09-06 8:43 ` Alexandre DERUMIER 2020-09-06 12:14 ` dietmar 2020-09-06 12:19 ` dietmar 2020-09-07 7:00 ` Thomas Lamprecht 2020-09-07 7:19 ` Alexandre DERUMIER 2020-09-07 8:18 ` dietmar 2020-09-07 9:32 ` Alexandre DERUMIER 2020-09-07 13:23 ` Alexandre DERUMIER 2020-09-08 4:41 ` dietmar 2020-09-08 7:11 ` Alexandre DERUMIER 2020-09-09 20:05 ` Thomas Lamprecht 2020-09-10 4:58 ` Alexandre DERUMIER 2020-09-10 8:21 ` Thomas Lamprecht 2020-09-10 11:34 ` Alexandre DERUMIER 2020-09-10 18:21 ` Thomas Lamprecht 2020-09-14 4:54 ` Alexandre DERUMIER 2020-09-14 7:14 ` Dietmar Maurer 2020-09-14 8:27 ` Alexandre DERUMIER 2020-09-14 8:51 ` Thomas Lamprecht 2020-09-14 15:45 ` Alexandre DERUMIER 2020-09-15 5:45 ` dietmar 2020-09-15 6:27 ` Alexandre DERUMIER 2020-09-15 7:13 ` dietmar 2020-09-15 8:42 ` Alexandre DERUMIER 2020-09-15 9:35 ` Alexandre DERUMIER 2020-09-15 9:46 ` Thomas Lamprecht 2020-09-15 10:15 ` Alexandre DERUMIER 2020-09-15 11:04 ` Alexandre DERUMIER 2020-09-15 12:49 ` Alexandre DERUMIER 2020-09-15 13:00 ` Thomas Lamprecht 2020-09-15 14:09 ` Alexandre DERUMIER 2020-09-15 14:19 ` Alexandre DERUMIER 2020-09-15 14:32 ` Thomas Lamprecht 2020-09-15 14:57 ` Alexandre DERUMIER 2020-09-15 15:58 ` Alexandre DERUMIER 2020-09-16 7:34 ` Alexandre DERUMIER 2020-09-16 7:58 ` Alexandre DERUMIER 2020-09-16 8:30 ` Alexandre DERUMIER 2020-09-16 8:53 ` Alexandre DERUMIER [not found] ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com> 2020-09-16 13:15 ` Alexandre DERUMIER 2020-09-16 14:45 ` Thomas Lamprecht 2020-09-16 15:17 ` Alexandre DERUMIER 2020-09-17 9:21 ` Fabian Grünbichler 2020-09-17 9:59 ` Alexandre DERUMIER 2020-09-17 10:02 ` Alexandre DERUMIER 2020-09-17 11:35 ` Thomas Lamprecht 2020-09-20 23:54 ` Alexandre DERUMIER 2020-09-22 5:43 ` Alexandre DERUMIER 2020-09-24 14:02 ` Fabian Grünbichler 2020-09-24 14:29 ` Alexandre DERUMIER 2020-09-24 18:07 ` Alexandre DERUMIER 2020-09-25 6:44 ` Alexandre DERUMIER 2020-09-25 7:15 ` Alexandre DERUMIER 2020-09-25 9:19 ` Fabian Grünbichler 2020-09-25 9:46 ` Alexandre DERUMIER 2020-09-25 12:51 ` Fabian Grünbichler 2020-09-25 16:29 ` Alexandre DERUMIER 2020-09-28 9:17 ` Fabian Grünbichler 2020-09-28 9:35 ` Alexandre DERUMIER 2020-09-28 15:59 ` Alexandre DERUMIER 2020-09-29 5:30 ` Alexandre DERUMIER 2020-09-29 8:51 ` Fabian Grünbichler 2020-09-29 9:37 ` Alexandre DERUMIER 2020-09-29 10:52 ` Alexandre DERUMIER 2020-09-29 11:43 ` Alexandre DERUMIER 2020-09-29 11:50 ` Alexandre DERUMIER 2020-09-29 13:28 ` Fabian Grünbichler 2020-09-29 13:52 ` Alexandre DERUMIER 2020-09-30 6:09 ` Alexandre DERUMIER 2020-09-30 6:26 ` Thomas Lamprecht 2020-09-15 7:58 ` Thomas Lamprecht 2020-12-29 14:21 ` Josef Johansson 2020-09-04 15:46 ` Alexandre DERUMIER 2020-09-30 15:50 ` Thomas Lamprecht 2020-10-15 9:16 ` Eneko Lacunza
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox