From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 9C85461B59 for ; Mon, 28 Sep 2020 11:17:46 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 8B8692C461 for ; Mon, 28 Sep 2020 11:17:46 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 4699E2C456 for ; Mon, 28 Sep 2020 11:17:45 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 0493D43F6D for ; Mon, 28 Sep 2020 11:17:45 +0200 (CEST) Date: Mon, 28 Sep 2020 11:17:37 +0200 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox VE development discussion References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com> <448071987.1233936.1600957757611.JavaMail.zimbra@odiso.com> <906463192.1239798.1600970863982.JavaMail.zimbra@odiso.com> <2019315449.1247737.1601016264202.JavaMail.zimbra@odiso.com> <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com> <1601024991.2yoxd1np1v.astroid@nora.none> <1157671072.1253096.1601027205997.JavaMail.zimbra@odiso.com> <1601037918.lwca57m6tz.astroid@nora.none> <2049133658.1264587.1601051377789.JavaMail.zimbra@odiso.com> In-Reply-To: <2049133658.1264587.1601051377789.JavaMail.zimbra@odiso.com> MIME-Version: 1.0 User-Agent: astroid/0.15.0 (https://github.com/astroidmail/astroid) Message-Id: <1601282139.yqoafefp96.astroid@nora.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 0.036 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Sep 2020 09:17:46 -0000 On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote: > here a new hang: >=20 > http://odisoweb1.odiso.net/test4/ okay, so at least we now know something strange inside pmxcfs is going=20 on, and not inside corosync - we never reach the part where the broken=20 node (again #13 in this hang!) sends out the state message: Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c= :706:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's hea= der type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message= (type =3D 1, subtype =3D 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deli= ver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request= (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_syn= c_resources (dfsm.c:640:dfsm_release_sync_resources) Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 0= 0000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state) [...] Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_messag= e_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full) this should be followed by output like this (from the -unblock log,=20 where the sync went through): Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message de= bug (dfsm.c:241:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997= 161967616 (dfsm.c:242:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c= :244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfs= m.c:244:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message lo= op body (dfsm.c:246:dfsm_send_message_full_debug_state) Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message re= sult: 1 (dfsm.c:250:dfsm_send_message_full_debug_state) but this never happens. the relevant code from our patched dfsm.c=20 (reordered): static cs_error_t=20 dfsm_send_state_message_full( dfsm_t *dfsm, uint16_t type, struct iovec *iov,=20 unsigned int len)=20 { g_return_val_if_fail(dfsm !=3D NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM)= ; g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM); // this message is still logged cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len= %d", type, len); // everything below this point might not have happened anymore dfsm_message_state_header_t header; header.base.type =3D type; header.base.subtype =3D 0; header.base.protocol_version =3D dfsm->protocol_version; header.base.time =3D time(NULL); header.base.reserved =3D 0; header.epoch =3D dfsm->sync_epoch; struct iovec real_iov[len + 1]; real_iov[0].iov_base =3D (char *)&header; real_iov[0].iov_len =3D sizeof(header); for (int i =3D 0; i < len; i++) real_iov[i + 1] =3D iov[i]; return type =3D=3D DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state= (dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len += 1, 1); } static cs_error_t=20 dfsm_send_message_full_debug_state( dfsm_t *dfsm, struct iovec *iov,=20 unsigned int len, int retry) { g_return_val_if_fail(dfsm !=3D NULL, CS_ERR_INVALID_PARAM); g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM); struct timespec tvreq =3D { .tv_sec =3D 0, .tv_nsec =3D 100000000 }; cs_error_t result; int retries =3D 0; // this message is not visible in log // we don't know how far above this we managed to run cfs_dom_message(dfsm->log_domain, "send state message debug"); cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle)= ; for (int i =3D 0; i < len; i++) cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len); loop: cfs_dom_message(dfsm->log_domain, "send state message loop body"); result =3D cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len); cfs_dom_message(dfsm->log_domain, "send state message result: %d", result)= ; if (retry && result =3D=3D CS_ERR_TRY_AGAIN) { nanosleep(&tvreq, NULL); ++retries; if ((retries % 10) =3D=3D 0) cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries); if (retries < 100) goto loop; } if (retries) cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", re= tries); if (result !=3D CS_OK && (!retry || result !=3D CS_ERR_TRY_AGAIN)) cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result)= ; return result; } I don't see much that could go wrong inside dfsm_send_state_message_full=20 after the debug log statement (it's just filling out the header and iov=20 structs, and calling 'dfsm_send_message_full_debug_state' since type =3D=3D= =20 2 =3D=3D DFSM_MESSAGE_STATE). inside dfsm_send_message_full_debug, before the first log statement we=20 only check for dfsm or the message length/content being NULL/0, all of=20 which can't really happen with that call path. also, in that case we'd=20 return CS_ERR_INVALID_PARAM , which would bubble up into the delivery=20 callback and cause us to leave CPG, which would again be visible in the=20 logs.. but, just to make sure, could you reproduce the issue once more, and=20 then (with debug symbols installed) run $ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof p= mxcfs) on all nodes at the same time? this should minimize the fallout and show=20 us whether the thread that logged the first part of sending the state is=20 still around on the node that triggers the hang.. > Something new: where doing coredump or bt-full on pmxcfs on node1, > this have unlocked /etc/pve on other nodes >=20 > /etc/pve unlocked(with coredump or bt-full): 17:57:40 this looks like you bt-ed corosync this time around? if so, than this is=20 expected: - attach gdb to corosync - corosync blocks - other nodes notice corosync is gone on node X - config change triggered - sync restarts on all nodes, does not trigger bug this time =