From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id DB17660867 for ; Thu, 24 Sep 2020 16:29:49 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C7E56132FF for ; Thu, 24 Sep 2020 16:29:19 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 10AD4132F5 for ; Thu, 24 Sep 2020 16:29:18 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id CCA371BB1011 for ; Thu, 24 Sep 2020 16:29:17 +0200 (CEST) Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id TdZOc3_ULO60 for ; Thu, 24 Sep 2020 16:29:17 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id B2BFA1BB1017 for ; Thu, 24 Sep 2020 16:29:17 +0200 (CEST) X-Virus-Scanned: amavisd-new at mailpro.odiso.com Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id Bo6k02bLHSIo for ; Thu, 24 Sep 2020 16:29:17 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111]) by mailpro.odiso.net (Postfix) with ESMTP id 9D2F21BB1011 for ; Thu, 24 Sep 2020 16:29:17 +0200 (CEST) Date: Thu, 24 Sep 2020 16:29:17 +0200 (CEST) From: Alexandre DERUMIER To: Proxmox VE development discussion Message-ID: <448071987.1233936.1600957757611.JavaMail.zimbra@odiso.com> In-Reply-To: <1600955976.my09kyug7a.astroid@nora.none> References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <1600333910.bmtyynl8cl.astroid@nora.none> <475756962.894651.1600336772315.JavaMail.zimbra@odiso.com> <86855479.894870.1600336947072.JavaMail.zimbra@odiso.com> <501f031f-3f1b-0633-fab3-7fcb7fdddaf5@proxmox.com> <335862527.964527.1600646099489.JavaMail.zimbra@odiso.com> <7286111.1004215.1600753433567.JavaMail.zimbra@odiso.com> <1600955976.my09kyug7a.astroid@nora.none> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844) Thread-Topic: corosync bug: cluster break after 1 node clean shutdown Thread-Index: 9/7vDexiYTrgZ2b8e1L+mbuGAIj+sA== X-SPAM-LEVEL: Spam detection results: 0 AWL 0.287 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_LOTSOFHASH 0.25 Emails with lots of hash-like gibberish RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Sep 2020 14:29:49 -0000 Hi fabian, >>if you are still able to test, it would be great if you could give the=20 >>following packages a spin (they only contain some extra debug prints=20 >>on message processing/sending):=20 Sure, no problem, I'm going to test it tonight. >>ideally, you could get the debug logs from all nodes, and the=20 >>coredump/bt from the node where pmxcfs hangs. thanks!=20 ok,no problem. I'll keep you in touch tomorrow. ----- Mail original ----- De: "Fabian Gr=C3=BCnbichler" =C3=80: "Proxmox VE development discussion" Envoy=C3=A9: Jeudi 24 Septembre 2020 16:02:04 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own On September 22, 2020 7:43 am, Alexandre DERUMIER wrote:=20 > I have done test with "kill -9 but after that it's become available again.=20 >=20 >=20 > So, it's really something when corosync is in shutdown phase, and pmxcfs = is running.=20 >=20 > So, for now, as workaround, I have changed=20 >=20 > /lib/systemd/system/pve-cluster.service=20 >=20 > #Wants=3Dcorosync.service=20 > #Before=3Dcorosync.service=20 > Requires=3Dcorosync.service=20 > After=3Dcorosync.service=20 >=20 >=20 > Like this, at shutdown, pve-cluster is stopped before corosync, and if I = restart corosync, pve-cluster is stopped first.=20 if you are still able to test, it would be great if you could give the=20 following packages a spin (they only contain some extra debug prints=20 on message processing/sending):=20 http://download.proxmox.com/temp/pmxcfs-dbg/=20 64eb9dbd2f60fe319abaeece89a84fd5de1f05a8c38cb9871058ec1f55025486ec15b7c0053= 976159fe5c2518615fd80084925abf4d3a5061ea7d6edef264c36 pve-cluster_6.1-8_amd= 64.deb=20 04b557c7f0dc1aa2846b534d6afab70c2b8d4720ac307364e015d885e2e997b6dcaa54cad67= 3b22d626d27cb053e5723510fde15d078d5fe1f262fc5486e6cef pve-cluster-dbgsym_6.= 1-8_amd64.deb=20 ideally, you could get the debug logs from all nodes, and the=20 coredump/bt from the node where pmxcfs hangs. thanks!=20 diff --git a/data/src/dfsm.c b/data/src/dfsm.c=20 index 529c7f9..e0bd93f 100644=20 --- a/data/src/dfsm.c=20 +++ b/data/src/dfsm.c=20 @@ -162,8 +162,8 @@ static void=20 dfsm_send_sync_message_abort(dfsm_t *dfsm)=20 {=20 g_return_if_fail(dfsm !=3D NULL);=20 -=20 g_mutex_lock (&dfsm->sync_mutex);=20 + cfs_dom_debug(dfsm->log_domain, "dfsm_send_sync_message_abort - %" PRIu64= " / %" PRIu64, dfsm->msgcount_rcvd, dfsm->msgcount);=20 dfsm->msgcount_rcvd =3D dfsm->msgcount;=20 g_cond_broadcast (&dfsm->sync_cond);=20 g_mutex_unlock (&dfsm->sync_mutex);=20 @@ -181,6 +181,7 @@ dfsm_record_local_result(=20 g_mutex_lock (&dfsm->sync_mutex);=20 dfsm_result_t *rp =3D (dfsm_result_t *)g_hash_table_lookup(dfsm->results, &= msg_count);=20 + cfs_dom_debug(dfsm->log_domain, "dfsm_record_local_result - %" PRIu64": %= d", msg_count, msg_result);=20 if (rp) {=20 rp->result =3D msg_result;=20 rp->processed =3D processed;=20 @@ -235,6 +236,8 @@ dfsm_send_state_message_full(=20 g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM);= =20 g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM);=20 + cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d le= n %d", type, len);=20 +=20 dfsm_message_state_header_t header;=20 header.base.type =3D type;=20 header.base.subtype =3D 0;=20 @@ -317,6 +320,7 @@ dfsm_send_message_sync(=20 for (int i =3D 0; i < len; i++)=20 real_iov[i + 1] =3D iov[i];=20 + cfs_dom_debug(dfsm->log_domain, "dfsm_send_messag_sync: type NORMAL, msgt= ype %d, len %d", msgtype, len);=20 cs_error_t result =3D dfsm_send_message_full(dfsm, real_iov, len + 1, 1);= =20 g_mutex_unlock (&dfsm->sync_mutex);=20 @@ -335,10 +339,12 @@ dfsm_send_message_sync(=20 if (rp) {=20 g_mutex_lock (&dfsm->sync_mutex);=20 - while (dfsm->msgcount_rcvd < msgcount)=20 + while (dfsm->msgcount_rcvd < msgcount) {=20 + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: waiting for rece= ived messages %" PRIu64 " / %" PRIu64, dfsm->msgcount_rcvd, msgcount);=20 g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex);=20 + }=20 + cfs_dom_debug(dfsm->log_domain, "dfsm_send_message_sync: done waiting for= received messages!");=20 -=20 g_hash_table_remove(dfsm->results, &rp->msgcount);=20 g_mutex_unlock (&dfsm->sync_mutex);=20 @@ -685,9 +691,13 @@ dfsm_cpg_deliver_callback(=20 return;=20 }=20 + cfs_dom_debug(dfsm->log_domain, "received message's header type is %d", b= ase_header->type);=20 +=20 if (base_header->type =3D=3D DFSM_MESSAGE_NORMAL) {=20 dfsm_message_normal_header_t *header =3D (dfsm_message_normal_header_t *)ms= g;=20 + cfs_dom_debug(dfsm->log_domain, "received normal message (type =3D %d, su= btype =3D %d, %zd bytes)",=20 + base_header->type, base_header->subtype, msg_len);=20 if (msg_len < sizeof(dfsm_message_normal_header_t)) {=20 cfs_dom_critical(dfsm->log_domain, "received short message (type =3D %d, su= btype =3D %d, %zd bytes)",=20 @@ -704,6 +714,8 @@ dfsm_cpg_deliver_callback(=20 } else {=20 int msg_res =3D -1;=20 + cfs_dom_debug(dfsm->log_domain, "deliver message %" PRIu64 " (subtype =3D= %d, length =3D %zd)",=20 + header->count, base_header->subtype, msg_len);=20 int res =3D dfsm->dfsm_callbacks->dfsm_deliver_fn(=20 dfsm, dfsm->data, &msg_res, nodeid, pid, base_header->subtype,=20 base_header->time, (uint8_t *)msg + sizeof(dfsm_message_normal_header_t),= =20 @@ -724,6 +736,8 @@ dfsm_cpg_deliver_callback(=20 */=20 dfsm_message_state_header_t *header =3D (dfsm_message_state_header_t *)msg;= =20 + cfs_dom_debug(dfsm->log_domain, "received state message (type =3D %d, sub= type =3D %d, %zd bytes), mode is %d",=20 + base_header->type, base_header->subtype, msg_len, mode);=20 if (msg_len < sizeof(dfsm_message_state_header_t)) {=20 cfs_dom_critical(dfsm->log_domain, "received short state message (type =3D = %d, subtype =3D %d, %zd bytes)",=20 @@ -744,6 +758,7 @@ dfsm_cpg_deliver_callback(=20 if (mode =3D=3D DFSM_MODE_SYNCED) {=20 if (base_header->type =3D=3D DFSM_MESSAGE_UPDATE_COMPLETE) {=20 + cfs_dom_debug(dfsm->log_domain, "received update complete message");=20 for (int i =3D 0; i < dfsm->sync_info->node_count; i++)=20 dfsm->sync_info->nodes[i].synced =3D 1;=20 @@ -754,6 +769,7 @@ dfsm_cpg_deliver_callback(=20 return;=20 } else if (base_header->type =3D=3D DFSM_MESSAGE_VERIFY_REQUEST) {=20 + cfs_dom_debug(dfsm->log_domain, "received verify request message");=20 if (msg_len !=3D sizeof(dfsm->csum_counter)) {=20 cfs_dom_critical(dfsm->log_domain, "cpg received verify request with wrong = length (%zd bytes) form node %d/%d", msg_len, nodeid, pid);=20 @@ -823,7 +839,6 @@ dfsm_cpg_deliver_callback(=20 } else if (mode =3D=3D DFSM_MODE_START_SYNC) {=20 if (base_header->type =3D=3D DFSM_MESSAGE_SYNC_START) {=20 -=20 if (nodeid !=3D dfsm->lowest_nodeid) {=20 cfs_dom_critical(dfsm->log_domain, "ignore sync request from wrong member %= d/%d",=20 nodeid, pid);=20 @@ -861,6 +876,7 @@ dfsm_cpg_deliver_callback(=20 return;=20 } else if (base_header->type =3D=3D DFSM_MESSAGE_STATE) {=20 + cfs_dom_debug(dfsm->log_domain, "received state message for %d/%d", nodei= d, pid);=20 dfsm_node_info_t *ni;=20 @@ -906,6 +922,8 @@ dfsm_cpg_deliver_callback(=20 goto leave;=20 }=20 + } else {=20 + cfs_dom_debug(dfsm->log_domain, "haven't received all states, waiting for= more");=20 }=20 return;=20 @@ -914,6 +932,7 @@ dfsm_cpg_deliver_callback(=20 } else if (mode =3D=3D DFSM_MODE_UPDATE) {=20 if (base_header->type =3D=3D DFSM_MESSAGE_UPDATE) {=20 + cfs_dom_debug(dfsm->log_domain, "received update message");=20 int res =3D dfsm->dfsm_callbacks->dfsm_process_update_fn(=20 dfsm, dfsm->data, dfsm->sync_info, nodeid, pid, msg, msg_len);=20 @@ -925,6 +944,7 @@ dfsm_cpg_deliver_callback(=20 } else if (base_header->type =3D=3D DFSM_MESSAGE_UPDATE_COMPLETE) {=20 + cfs_dom_debug(dfsm->log_domain, "received update complete message");=20 int res =3D dfsm->dfsm_callbacks->dfsm_commit_fn(dfsm, dfsm->data, dfsm->sy= nc_info);=20 @@ -1104,6 +1124,7 @@ dfsm_cpg_confchg_callback(=20 size_t joined_list_entries)=20 {=20 cs_error_t result;=20 + cfs_debug("dfsm_cpg_confchg_callback called");=20 dfsm_t *dfsm =3D NULL;=20 result =3D cpg_context_get(handle, (gpointer *)&dfsm);=20 _______________________________________________=20 pve-devel mailing list=20 pve-devel@lists.proxmox.com=20 https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel=20