From: Alexandre DERUMIER <aderumier@odiso.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
Date: Fri, 25 Sep 2020 09:15:49 +0200 (CEST) [thread overview]
Message-ID: <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com> (raw)
In-Reply-To: <2019315449.1247737.1601016264202.JavaMail.zimbra@odiso.com>
Another hang, this time on corosync stop, coredump available
http://odisoweb1.odiso.net/test3/
node1
----
stop corosync : 09:03:10
node2: /etc/pve locked
------
Current time : 09:03:10
----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Vendredi 25 Septembre 2020 08:44:24
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
Another test this morning with the coredump available
http://odisoweb1.odiso.net/test2/
Something different this time, it has happened on corosync start
node1 (corosync start)
------
start corosync : 08:06:56
node2 (/etc/pve locked)
-----
Current time : 08:07:01
I had a warning on coredump
(gdb) generate-core-file
warning: target file /proc/35248/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.35248
I hope it's ok.
I'll do another test this morning
----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Jeudi 24 Septembre 2020 20:07:43
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
I was able to reproduce
stop corosync on node1 : 18:12:29
/etc/pve locked at 18:12:30
logs of all nodes are here:
http://odisoweb1.odiso.net/test1/
I don't have coredump, as my coworker have restarted pmxcfs too fast :/ . Sorry.
I'm going to launch another test with coredump this time
----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Jeudi 24 Septembre 2020 16:29:17
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
Hi fabian,
>>if you are still able to test, it would be great if you could give the
>>following packages a spin (they only contain some extra debug prints
>>on message processing/sending):
Sure, no problem, I'm going to test it tonight.
>>ideally, you could get the debug logs from all nodes, and the
>>coredump/bt from the node where pmxcfs hangs. thanks!
ok,no problem.
I'll keep you in touch tomorrow.
----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Jeudi 24 Septembre 2020 16:02:04
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
On September 22, 2020 7:43 am, Alexandre DERUMIER wrote:
> I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes,
> but after that it's become available again.
>
>
> So, it's really something when corosync is in shutdown phase, and pmxcfs is running.
>
> So, for now, as workaround, I have changed
>
> /lib/systemd/system/pve-cluster.service
>
> #Wants=corosync.service
> #Before=corosync.service
> Requires=corosync.service
> After=corosync.service
>
>
> Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first.
if you are still able to test, it would be great if you could give the
following packages a spin (they only contain some extra debug prints
on message processing/sending):
http://download.proxmox.com/temp/pmxcfs-dbg/
64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd64.deb
04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad673b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.1-8_amd64.deb
ideally, you could get the debug logs from all nodes, and the
coredump/bt from the node where pmxcfs hangs. thanks!
diff --git a/data/src/dfsm.c b/data/src/dfsm.c
index 529c7f9..e0bd93f 100644
--- a/data/src/dfsm.c
+++ b/data/src/dfsm.c
@@ -162,8 +162,8 @@ static void
dfsm_send_sync_message_abort(dfsm_t *dfsm)
{
g_return_if_fail(dfsm != NULL);
-
g_mutex_lock (&dfsm->sync_mutex);
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64" / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount);
dfsm->msgcount_rcvd = dfsm->msgcount;
g_cond_broadcast (&dfsm->sync_cond);
g_mutex_unlock (&dfsm->sync_mutex);
@@ -181,6 +181,7 @@ dfsm_record_local_result(
g_mutex_lock (&dfsm->sync_mutex);
dfsm_result_t *rp = (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &msg_count);
+ cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %d", msg_count, msg_result);
if (rp) {
rp->result = msg_result;
rp->processed = processed;
@@ -235,6 +236,8 @@ dfsm_send_state_message_full(
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM);
g_return_val_if_fail(!len || iov != NULL, CS_ERR_INVALID_PARAM);
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len %d", type, len);
+
dfsm_message_state_header_t header;
header.base.type = type;
header.base.subtype = 0;
@@ -317,6 +320,7 @@ dfsm_send_message_sync(
for (int i = 0; i < len; i++)
real_iov[i + 1] = iov[i];
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgtype %d, len %d", msgtype, len);
cs_error_t result = dfsm_send_message_full(dfsm, real_iov, len + 1, 1);
g_mutex_unlock (&dfsm->sync_mutex);
@@ -335,10 +339,12 @@ dfsm_send_message_sync(
if (rp) {
g_mutex_lock (&dfsm->sync_mutex);
- while (dfsm->msgcount_rcvd < msgcount)
+ while (dfsm->msgcount_rcvd < msgcount) {
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for received messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount);
g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex);
+ }
+ cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for received messages!");
-
g_hash_table_remove(dfsm->results, &rp->msgcount);
g_mutex_unlock (&dfsm->sync_mutex);
@@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback(
return;
}
+ cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", base_header->type);
+
if (base_header->type == DFSM_MESSAGE_NORMAL) {
dfsm_message_normal_header_t *header = (dfsm_message_normal_header_t *)msg;
+ cfs_dom_debug(dfsm->log_domain, "received normal message (type = %d, subtype = %d, %zd bytes)",
+ base_header->type, base_header->subtype, msg_len);
if (msg_len < sizeof(dfsm_message_normal_header_t)) {
cfs_dom_critical(dfsm->log_domain, "received short message (type = %d, subtype = %d, %zd bytes)",
@@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback(
} else {
int msg_res = -1;
+ cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype = %d, length = %zd)",
+ header->count, base_header->subtype, msg_len);
int res = dfsm->dfsm_callbacks->dfsm_deliver_fn(
dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype,
base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t),
@@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback(
*/
dfsm_message_state_header_t *header = (dfsm_message_state_header_t *)msg;
+ cfs_dom_debug(dfsm->log_domain, "received state message (type = %d, subtype = %d, %zd bytes), mode is %d",
+ base_header->type, base_header->subtype, msg_len, mode);
if (msg_len < sizeof(dfsm_message_state_header_t)) {
cfs_dom_critical(dfsm->log_domain, "received short state message (type = %d, subtype = %d, %zd bytes)",
@@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback(
if (mode == DFSM_MODE_SYNCED) {
if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) {
+ cfs_dom_debug(dfsm->log_domain, "received update complete message");
for (int i = 0; i < dfsm->sync_info->node_count; i++)
dfsm->sync_info->nodes[i].synced = 1;
@@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback(
return;
} else if (base_header->type == DFSM_MESSAGE_VERIFY_REQUEST) {
+ cfs_dom_debug(dfsm->log_domain, "received verify request message");
if (msg_len != sizeof(dfsm->csum_counter)) {
cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong length (%zd bytes) form node %d/%d", msg_len, nodeid, pid);
@@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback(
} else if (mode == DFSM_MODE_START_SYNC) {
if (base_header->type == DFSM_MESSAGE_SYNC_START) {
-
if (nodeid != dfsm->lowest_nodeid) {
cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %d/%d",
nodeid, pid);
@@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback(
return;
} else if (base_header->type == DFSM_MESSAGE_STATE) {
+ cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodeid, pid);
dfsm_node_info_t *ni;
@@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback(
goto leave;
}
+ } else {
+ cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for more");
}
return;
@@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback(
} else if (mode == DFSM_MODE_UPDATE) {
if (base_header->type == DFSM_MESSAGE_UPDATE) {
+ cfs_dom_debug(dfsm->log_domain, "received update message");
int res = dfsm->dfsm_callbacks->dfsm_process_update_fn(
dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len);
@@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback(
} else if (base_header->type == DFSM_MESSAGE_UPDATE_COMPLETE) {
+ cfs_dom_debug(dfsm->log_domain, "received update complete message");
int res = dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sync_info);
@@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback(
size_t joined_list_entries)
{
cs_error_t result;
+ cfs_debug("dfsm_cpg_confchg_callback called");
dfsm_t *dfsm = NULL;
result = cpg_context_get(handle, (gpointer *)&dfsm);
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
next prev parent reply other threads:[~2020-09-25 7:16 UTC|newest]
Thread overview: 84+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-03 14:11 Alexandre DERUMIER
2020-09-04 12:29 ` Alexandre DERUMIER
2020-09-04 15:42 ` Dietmar Maurer
2020-09-05 13:32 ` Alexandre DERUMIER
2020-09-05 15:23 ` dietmar
2020-09-05 17:30 ` Alexandre DERUMIER
2020-09-06 4:21 ` dietmar
2020-09-06 5:36 ` Alexandre DERUMIER
2020-09-06 6:33 ` Alexandre DERUMIER
2020-09-06 8:43 ` Alexandre DERUMIER
2020-09-06 12:14 ` dietmar
2020-09-06 12:19 ` dietmar
2020-09-07 7:00 ` Thomas Lamprecht
2020-09-07 7:19 ` Alexandre DERUMIER
2020-09-07 8:18 ` dietmar
2020-09-07 9:32 ` Alexandre DERUMIER
2020-09-07 13:23 ` Alexandre DERUMIER
2020-09-08 4:41 ` dietmar
2020-09-08 7:11 ` Alexandre DERUMIER
2020-09-09 20:05 ` Thomas Lamprecht
2020-09-10 4:58 ` Alexandre DERUMIER
2020-09-10 8:21 ` Thomas Lamprecht
2020-09-10 11:34 ` Alexandre DERUMIER
2020-09-10 18:21 ` Thomas Lamprecht
2020-09-14 4:54 ` Alexandre DERUMIER
2020-09-14 7:14 ` Dietmar Maurer
2020-09-14 8:27 ` Alexandre DERUMIER
2020-09-14 8:51 ` Thomas Lamprecht
2020-09-14 15:45 ` Alexandre DERUMIER
2020-09-15 5:45 ` dietmar
2020-09-15 6:27 ` Alexandre DERUMIER
2020-09-15 7:13 ` dietmar
2020-09-15 8:42 ` Alexandre DERUMIER
2020-09-15 9:35 ` Alexandre DERUMIER
2020-09-15 9:46 ` Thomas Lamprecht
2020-09-15 10:15 ` Alexandre DERUMIER
2020-09-15 11:04 ` Alexandre DERUMIER
2020-09-15 12:49 ` Alexandre DERUMIER
2020-09-15 13:00 ` Thomas Lamprecht
2020-09-15 14:09 ` Alexandre DERUMIER
2020-09-15 14:19 ` Alexandre DERUMIER
2020-09-15 14:32 ` Thomas Lamprecht
2020-09-15 14:57 ` Alexandre DERUMIER
2020-09-15 15:58 ` Alexandre DERUMIER
2020-09-16 7:34 ` Alexandre DERUMIER
2020-09-16 7:58 ` Alexandre DERUMIER
2020-09-16 8:30 ` Alexandre DERUMIER
2020-09-16 8:53 ` Alexandre DERUMIER
[not found] ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com>
2020-09-16 13:15 ` Alexandre DERUMIER
2020-09-16 14:45 ` Thomas Lamprecht
2020-09-16 15:17 ` Alexandre DERUMIER
2020-09-17 9:21 ` Fabian Grünbichler
2020-09-17 9:59 ` Alexandre DERUMIER
2020-09-17 10:02 ` Alexandre DERUMIER
2020-09-17 11:35 ` Thomas Lamprecht
2020-09-20 23:54 ` Alexandre DERUMIER
2020-09-22 5:43 ` Alexandre DERUMIER
2020-09-24 14:02 ` Fabian Grünbichler
2020-09-24 14:29 ` Alexandre DERUMIER
2020-09-24 18:07 ` Alexandre DERUMIER
2020-09-25 6:44 ` Alexandre DERUMIER
2020-09-25 7:15 ` Alexandre DERUMIER [this message]
2020-09-25 9:19 ` Fabian Grünbichler
2020-09-25 9:46 ` Alexandre DERUMIER
2020-09-25 12:51 ` Fabian Grünbichler
2020-09-25 16:29 ` Alexandre DERUMIER
2020-09-28 9:17 ` Fabian Grünbichler
2020-09-28 9:35 ` Alexandre DERUMIER
2020-09-28 15:59 ` Alexandre DERUMIER
2020-09-29 5:30 ` Alexandre DERUMIER
2020-09-29 8:51 ` Fabian Grünbichler
2020-09-29 9:37 ` Alexandre DERUMIER
2020-09-29 10:52 ` Alexandre DERUMIER
2020-09-29 11:43 ` Alexandre DERUMIER
2020-09-29 11:50 ` Alexandre DERUMIER
2020-09-29 13:28 ` Fabian Grünbichler
2020-09-29 13:52 ` Alexandre DERUMIER
2020-09-30 6:09 ` Alexandre DERUMIER
2020-09-30 6:26 ` Thomas Lamprecht
2020-09-15 7:58 ` Thomas Lamprecht
2020-12-29 14:21 ` Josef Johansson
2020-09-04 15:46 ` Alexandre DERUMIER
2020-09-30 15:50 ` Thomas Lamprecht
2020-10-15 9:16 ` Eneko Lacunza
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com \
--to=aderumier@odiso.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox