From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.gruenbichler@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 9C85461B59
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 11:17:46 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 8B8692C461
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 11:17:46 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 4699E2C456
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 11:17:45 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 0493D43F6D
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 11:17:45 +0200 (CEST)
Date: Mon, 28 Sep 2020 11:17:37 +0200
From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com>
 <448071987.1233936.1600957757611.JavaMail.zimbra@odiso.com>
 <906463192.1239798.1600970863982.JavaMail.zimbra@odiso.com>
 <2019315449.1247737.1601016264202.JavaMail.zimbra@odiso.com>
 <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com>
 <1601024991.2yoxd1np1v.astroid@nora.none>
 <1157671072.1253096.1601027205997.JavaMail.zimbra@odiso.com>
 <1601037918.lwca57m6tz.astroid@nora.none>
 <2049133658.1264587.1601051377789.JavaMail.zimbra@odiso.com>
In-Reply-To: <2049133658.1264587.1601051377789.JavaMail.zimbra@odiso.com>
MIME-Version: 1.0
User-Agent: astroid/0.15.0 (https://github.com/astroidmail/astroid)
Message-Id: <1601282139.yqoafefp96.astroid@nora.none>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.036 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean
 shutdown
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Mon, 28 Sep 2020 09:17:46 -0000

On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote:
> here a new hang:
>=20
> http://odisoweb1.odiso.net/test4/

okay, so at least we now know something strange inside pmxcfs is going=20
on, and not inside corosync - we never reach the part where the broken=20
node (again #13 in this hang!) sends out the state message:

Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c=
:706:dfsm_cpg_deliver_callback)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's hea=
der type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message=
 (type =3D 1, subtype =3D 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deli=
ver_callback)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request=
 (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_syn=
c_resources (dfsm.c:640:dfsm_release_sync_resources)
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 0=
0000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state)
[...]
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_messag=
e_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full)

this should be followed by output like this (from the -unblock log,=20
where the sync went through):

Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message de=
bug (dfsm.c:241:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997=
161967616 (dfsm.c:242:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c=
:244:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfs=
m.c:244:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message lo=
op body (dfsm.c:246:dfsm_send_message_full_debug_state)
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message re=
sult: 1 (dfsm.c:250:dfsm_send_message_full_debug_state)

but this never happens. the relevant code from our patched dfsm.c=20
(reordered):

static cs_error_t=20
dfsm_send_state_message_full(
	dfsm_t *dfsm,
	uint16_t type,
	struct iovec *iov,=20
	unsigned int len)=20
{
	g_return_val_if_fail(dfsm !=3D NULL, CS_ERR_INVALID_PARAM);
	g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM)=
;
	g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM);

// this message is still logged
	cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len=
 %d", type, len);

// everything below this point might not have happened anymore

	dfsm_message_state_header_t header;
	header.base.type =3D type;
	header.base.subtype =3D 0;
	header.base.protocol_version =3D dfsm->protocol_version;
	header.base.time =3D time(NULL);
	header.base.reserved =3D 0;

	header.epoch =3D dfsm->sync_epoch;

	struct iovec real_iov[len + 1];

	real_iov[0].iov_base =3D (char *)&header;
	real_iov[0].iov_len =3D sizeof(header);

	for (int i =3D 0; i < len; i++)
		real_iov[i + 1] =3D iov[i];

	return type =3D=3D DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state=
(dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len +=
 1, 1);
}

static cs_error_t=20
dfsm_send_message_full_debug_state(
	dfsm_t *dfsm,
	struct iovec *iov,=20
	unsigned int len,
	int retry)
{
	g_return_val_if_fail(dfsm !=3D NULL, CS_ERR_INVALID_PARAM);
	g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM);

	struct timespec tvreq =3D { .tv_sec =3D 0, .tv_nsec =3D 100000000 };
	cs_error_t result;
	int retries =3D 0;

// this message is not visible in log
// we don't know how far above this we managed to run
	cfs_dom_message(dfsm->log_domain, "send state message debug");
	cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle)=
;
	for (int i =3D 0; i < len; i++)
		cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len);
loop:
	cfs_dom_message(dfsm->log_domain, "send state message loop body");

	result =3D cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);

	cfs_dom_message(dfsm->log_domain, "send state message result: %d", result)=
;
	if (retry && result =3D=3D CS_ERR_TRY_AGAIN) {
		nanosleep(&tvreq, NULL);
		++retries;
		if ((retries % 10) =3D=3D 0)
			cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);
		if (retries < 100)
			goto loop;
	}

	if (retries)
		cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", re=
tries);

	if (result !=3D CS_OK &&
	    (!retry || result !=3D CS_ERR_TRY_AGAIN))
		cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result)=
;

	return result;
}

I don't see much that could go wrong inside dfsm_send_state_message_full=20
after the debug log statement (it's just filling out the header and iov=20
structs, and calling 'dfsm_send_message_full_debug_state' since type =3D=3D=
=20
2 =3D=3D DFSM_MESSAGE_STATE).

inside dfsm_send_message_full_debug, before the first log statement we=20
only check for dfsm or the message length/content being NULL/0, all of=20
which can't really happen with that call path. also, in that case we'd=20
return CS_ERR_INVALID_PARAM , which would bubble up into the delivery=20
callback and cause us to leave CPG, which would again be visible in the=20
logs..

but, just to make sure, could you reproduce the issue once more, and=20
then (with debug symbols installed) run

$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof p=
mxcfs)

on all nodes at the same time? this should minimize the fallout and show=20
us whether the thread that logged the first part of sending the state is=20
still around on the node that triggers the hang..

> Something new: where doing coredump or bt-full on pmxcfs on node1,
> this have unlocked /etc/pve on other nodes
>=20
> /etc/pve unlocked(with coredump or bt-full): 17:57:40

this looks like you bt-ed corosync this time around? if so, than this is=20
expected:

- attach gdb to corosync
- corosync blocks
- other nodes notice corosync is gone on node X
- config change triggered
- sync restarts on all nodes, does not trigger bug this time
=