From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 1ECD361E5F for ; Mon, 28 Sep 2020 17:59:23 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 129822FF50 for ; Mon, 28 Sep 2020 17:59:23 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 9DBD82FF46 for ; Mon, 28 Sep 2020 17:59:21 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 718611C6F4AF for ; Mon, 28 Sep 2020 17:59:21 +0200 (CEST) Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id kOyDGR7r01Kv for ; Mon, 28 Sep 2020 17:59:21 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailpro.odiso.net (Postfix) with ESMTP id 56C4F1C6F4B0 for ; Mon, 28 Sep 2020 17:59:21 +0200 (CEST) X-Virus-Scanned: amavisd-new at mailpro.odiso.com Received: from mailpro.odiso.net ([127.0.0.1]) by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id kDWQkys3Lnvu for ; Mon, 28 Sep 2020 17:59:21 +0200 (CEST) Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111]) by mailpro.odiso.net (Postfix) with ESMTP id 39E361C6F4AF for ; Mon, 28 Sep 2020 17:59:21 +0200 (CEST) Date: Mon, 28 Sep 2020 17:59:20 +0200 (CEST) From: Alexandre DERUMIER To: Proxmox VE development discussion Message-ID: <260722331.1517115.1601308760989.JavaMail.zimbra@odiso.com> In-Reply-To: <936571798.1335244.1601285700689.JavaMail.zimbra@odiso.com> References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com> <1601024991.2yoxd1np1v.astroid@nora.none> <1157671072.1253096.1601027205997.JavaMail.zimbra@odiso.com> <1601037918.lwca57m6tz.astroid@nora.none> <2049133658.1264587.1601051377789.JavaMail.zimbra@odiso.com> <1601282139.yqoafefp96.astroid@nora.none> <936571798.1335244.1601285700689.JavaMail.zimbra@odiso.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844) Thread-Topic: corosync bug: cluster break after 1 node clean shutdown Thread-Index: bhZJkOyyn3NfaAhhkTRk6Oxo5rrBMaaUXE8f X-SPAM-LEVEL: Spam detection results: 0 AWL 0.100 Adjusted score from AWL reputation of From: address KAM_ASCII_DIVIDERS 0.8 Spam that uses ascii formatting tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Sep 2020 15:59:23 -0000 Here a new test http://odisoweb1.odiso.net/test5 This has occured at corosync start node1: ----- start corosync : 17:30:19 node2: /etc/pve locked -------------- Current time : 17:30:24 I have done backtrace of all nodes at same time with parallel ssh at 17:35:= 22=20 and a coredump of all nodes at same time with parallel ssh at 17:42:26 (Note that this time, /etc/pve was still locked after backtrace/coredump) ----- Mail original ----- De: "aderumier" =C3=80: "Proxmox VE development discussion" Envoy=C3=A9: Lundi 28 Septembre 2020 11:35:00 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own >>but, just to make sure, could you reproduce the issue once more, and=20 >>then (with debug symbols installed) run=20 >>=20 >>$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof= pmxcfs)=20 >>=20 >>on all nodes at the same time? this should minimize the fallout and show= =20 >>us whether the thread that logged the first part of sending the state is= =20 >>still around on the node that triggers the hang..=20 ok, no problem, I'll do a new test today=20 >>this looks like you bt-ed corosync this time around? if so, than this is= =20 >>expected:=20 >>=20 >>- attach gdb to corosync=20 >>- corosync blocks=20 >>- other nodes notice corosync is gone on node X=20 >>- config change triggered=20 >>- sync restarts on all nodes, does not trigger bug this time=20 ok, thanks, make sense. I didn't notice this previously, but this time it w= as with corosync started.=20 ----- Mail original -----=20 De: "Fabian Gr=C3=BCnbichler" =20 =C3=80: "Proxmox VE development discussion" = =20 Envoy=C3=A9: Lundi 28 Septembre 2020 11:17:37=20 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd= own=20 On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote:=20 > here a new hang:=20 >=20 > http://odisoweb1.odiso.net/test4/=20 okay, so at least we now know something strange inside pmxcfs is going=20 on, and not inside corosync - we never reach the part where the broken=20 node (again #13 in this hang!) sends out the state message:=20 Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c= :706:dfsm_cpg_deliver_callback)=20 Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's hea= der type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback)=20 Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message= (type =3D 1, subtype =3D 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deli= ver_callback)=20 Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request= (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback)=20 Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_syn= c_resources (dfsm.c:640:dfsm_release_sync_resources)=20 Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 0= 0000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state)=20 [...]=20 Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_messag= e_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full)=20 this should be followed by output like this (from the -unblock log,=20 where the sync went through):=20 Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message de= bug (dfsm.c:241:dfsm_send_message_full_debug_state)=20 Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997= 161967616 (dfsm.c:242:dfsm_send_message_full_debug_state)=20 Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c= :244:dfsm_send_message_full_debug_state)=20 Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfs= m.c:244:dfsm_send_message_full_debug_state)=20 Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message lo= op body (dfsm.c:246:dfsm_send_message_full_debug_state)=20 Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message re= sult: 1 (dfsm.c:250:dfsm_send_message_full_debug_state)=20 but this never happens. the relevant code from our patched dfsm.c=20 (reordered):=20 static cs_error_t=20 dfsm_send_state_message_full(=20 dfsm_t *dfsm,=20 uint16_t type,=20 struct iovec *iov,=20 unsigned int len)=20 {=20 g_return_val_if_fail(dfsm !=3D NULL, CS_ERR_INVALID_PARAM);=20 g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM);= =20 g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM);=20 // this message is still logged=20 cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len = %d", type, len);=20 // everything below this point might not have happened anymore=20 dfsm_message_state_header_t header;=20 header.base.type =3D type;=20 header.base.subtype =3D 0;=20 header.base.protocol_version =3D dfsm->protocol_version;=20 header.base.time =3D time(NULL);=20 header.base.reserved =3D 0;=20 header.epoch =3D dfsm->sync_epoch;=20 struct iovec real_iov[len + 1];=20 real_iov[0].iov_base =3D (char *)&header;=20 real_iov[0].iov_len =3D sizeof(header);=20 for (int i =3D 0; i < len; i++)=20 real_iov[i + 1] =3D iov[i];=20 return type =3D=3D DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(= dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + = 1, 1);=20 }=20 static cs_error_t=20 dfsm_send_message_full_debug_state(=20 dfsm_t *dfsm,=20 struct iovec *iov,=20 unsigned int len,=20 int retry)=20 {=20 g_return_val_if_fail(dfsm !=3D NULL, CS_ERR_INVALID_PARAM);=20 g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM);=20 struct timespec tvreq =3D { .tv_sec =3D 0, .tv_nsec =3D 100000000 };=20 cs_error_t result;=20 int retries =3D 0;=20 // this message is not visible in log=20 // we don't know how far above this we managed to run=20 cfs_dom_message(dfsm->log_domain, "send state message debug");=20 cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle);= =20 for (int i =3D 0; i < len; i++)=20 cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len);=20 loop:=20 cfs_dom_message(dfsm->log_domain, "send state message loop body");=20 result =3D cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);= =20 cfs_dom_message(dfsm->log_domain, "send state message result: %d", result);= =20 if (retry && result =3D=3D CS_ERR_TRY_AGAIN) {=20 nanosleep(&tvreq, NULL);=20 ++retries;=20 if ((retries % 10) =3D=3D 0)=20 cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);=20 if (retries < 100)=20 goto loop;=20 }=20 if (retries)=20 cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retr= ies);=20 if (result !=3D CS_OK &&=20 (!retry || result !=3D CS_ERR_TRY_AGAIN))=20 cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result);= =20 return result;=20 }=20 I don't see much that could go wrong inside dfsm_send_state_message_full=20 after the debug log statement (it's just filling out the header and iov=20 structs, and calling 'dfsm_send_message_full_debug_state' since type =3D=3D= =20 2 =3D=3D DFSM_MESSAGE_STATE).=20 inside dfsm_send_message_full_debug, before the first log statement we=20 only check for dfsm or the message length/content being NULL/0, all of=20 which can't really happen with that call path. also, in that case we'd=20 return CS_ERR_INVALID_PARAM , which would bubble up into the delivery=20 callback and cause us to leave CPG, which would again be visible in the=20 logs..=20 but, just to make sure, could you reproduce the issue once more, and=20 then (with debug symbols installed) run=20 $ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof p= mxcfs)=20 on all nodes at the same time? this should minimize the fallout and show=20 us whether the thread that logged the first part of sending the state is=20 still around on the node that triggers the hang..=20 > Something new: where doing coredump or bt-full on pmxcfs on node1,=20 > this have unlocked /etc/pve on other nodes=20 >=20 > /etc/pve unlocked(with coredump or bt-full): 17:57:40=20 this looks like you bt-ed corosync this time around? if so, than this is=20 expected:=20 - attach gdb to corosync=20 - corosync blocks=20 - other nodes notice corosync is gone on node X=20 - config change triggered=20 - sync restarts on all nodes, does not trigger bug this time=20 _______________________________________________=20 pve-devel mailing list=20 pve-devel@lists.proxmox.com=20 https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel=20 _______________________________________________=20 pve-devel mailing list=20 pve-devel@lists.proxmox.com=20 https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel=20