From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <aderumier@odiso.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 1ECD361E5F
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 17:59:23 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 129822FF50
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 17:59:23 +0200 (CEST)
Received: from mailpro.odiso.net (mailpro.odiso.net [89.248.211.110])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 9DBD82FF46
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 17:59:21 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
 by mailpro.odiso.net (Postfix) with ESMTP id 718611C6F4AF
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 17:59:21 +0200 (CEST)
Received: from mailpro.odiso.net ([127.0.0.1])
 by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id kOyDGR7r01Kv for <pve-devel@lists.proxmox.com>;
 Mon, 28 Sep 2020 17:59:21 +0200 (CEST)
Received: from localhost (localhost [127.0.0.1])
 by mailpro.odiso.net (Postfix) with ESMTP id 56C4F1C6F4B0
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 17:59:21 +0200 (CEST)
X-Virus-Scanned: amavisd-new at mailpro.odiso.com
Received: from mailpro.odiso.net ([127.0.0.1])
 by localhost (mailpro.odiso.net [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id kDWQkys3Lnvu for <pve-devel@lists.proxmox.com>;
 Mon, 28 Sep 2020 17:59:21 +0200 (CEST)
Received: from mailpro.odiso.net (mailpro.odiso.net [10.1.31.111])
 by mailpro.odiso.net (Postfix) with ESMTP id 39E361C6F4AF
 for <pve-devel@lists.proxmox.com>; Mon, 28 Sep 2020 17:59:21 +0200 (CEST)
Date: Mon, 28 Sep 2020 17:59:20 +0200 (CEST)
From: Alexandre DERUMIER <aderumier@odiso.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Message-ID: <260722331.1517115.1601308760989.JavaMail.zimbra@odiso.com>
In-Reply-To: <936571798.1335244.1601285700689.JavaMail.zimbra@odiso.com>
References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com>
 <1264529857.1248647.1601018149719.JavaMail.zimbra@odiso.com>
 <1601024991.2yoxd1np1v.astroid@nora.none>
 <1157671072.1253096.1601027205997.JavaMail.zimbra@odiso.com>
 <1601037918.lwca57m6tz.astroid@nora.none>
 <2049133658.1264587.1601051377789.JavaMail.zimbra@odiso.com>
 <1601282139.yqoafefp96.astroid@nora.none>
 <936571798.1335244.1601285700689.JavaMail.zimbra@odiso.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Mailer: Zimbra 8.8.12_GA_3866 (ZimbraWebClient - GC83 (Linux)/8.8.12_GA_3844)
Thread-Topic: corosync bug: cluster break after 1 node clean shutdown
Thread-Index: bhZJkOyyn3NfaAhhkTRk6Oxo5rrBMaaUXE8f
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.100 Adjusted score from AWL reputation of From: address
 KAM_ASCII_DIVIDERS        0.8 Spam that uses ascii formatting tricks
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_NONE     -0.0001 Sender listed at https://www.dnswl.org/,
 no trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean
 shutdown
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Mon, 28 Sep 2020 15:59:23 -0000

Here a new test http://odisoweb1.odiso.net/test5

This has occured at corosync start


node1:
-----
start corosync : 17:30:19


node2: /etc/pve locked
--------------
Current time : 17:30:24


I have done backtrace of all nodes at same time with parallel ssh at 17:35:=
22=20

and a coredump of all nodes at same time with parallel ssh at 17:42:26


(Note that this time, /etc/pve was still locked after backtrace/coredump)


----- Mail original -----
De: "aderumier" <aderumier@odiso.com>
=C3=80: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoy=C3=A9: Lundi 28 Septembre 2020 11:35:00
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd=
own

>>but, just to make sure, could you reproduce the issue once more, and=20
>>then (with debug symbols installed) run=20
>>=20
>>$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof=
 pmxcfs)=20
>>=20
>>on all nodes at the same time? this should minimize the fallout and show=
=20
>>us whether the thread that logged the first part of sending the state is=
=20
>>still around on the node that triggers the hang..=20

ok, no problem, I'll do a new test today=20


>>this looks like you bt-ed corosync this time around? if so, than this is=
=20
>>expected:=20
>>=20
>>- attach gdb to corosync=20
>>- corosync blocks=20
>>- other nodes notice corosync is gone on node X=20
>>- config change triggered=20
>>- sync restarts on all nodes, does not trigger bug this time=20

ok, thanks, make sense. I didn't notice this previously, but this time it w=
as with corosync started.=20


----- Mail original -----=20
De: "Fabian Gr=C3=BCnbichler" <f.gruenbichler@proxmox.com>=20
=C3=80: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>=
=20
Envoy=C3=A9: Lundi 28 Septembre 2020 11:17:37=20
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutd=
own=20

On September 25, 2020 6:29 pm, Alexandre DERUMIER wrote:=20
> here a new hang:=20
>=20
> http://odisoweb1.odiso.net/test4/=20

okay, so at least we now know something strange inside pmxcfs is going=20
on, and not inside corosync - we never reach the part where the broken=20
node (again #13 in this hang!) sends out the state message:=20

Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm mode is 1 (dfsm.c=
:706:dfsm_cpg_deliver_callback)=20
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received message's hea=
der type is 1 (dfsm.c:736:dfsm_cpg_deliver_callback)=20
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: received state message=
 (type =3D 1, subtype =3D 0, 32 bytes), mode is 1 (dfsm.c:782:dfsm_cpg_deli=
ver_callback)=20
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] notice: received sync request=
 (epoch 1/23389/000000A8) (dfsm.c:890:dfsm_cpg_deliver_callback)=20
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [main] debug: enter dfsm_release_syn=
c_resources (dfsm.c:640:dfsm_release_sync_resources)=20
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: enter dcdb_get_state 0=
0000000365345C2 5F6E0B1F (dcdb.c:444:dcdb_get_state)=20
[...]=20
Sep 25 17:22:08 m6kvm13 pmxcfs[10086]: [dcdb] debug: dfsm_send_state_messag=
e_full: type 2 len 1 (dfsm.c:281:dfsm_send_state_message_full)=20

this should be followed by output like this (from the -unblock log,=20
where the sync went through):=20

Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message de=
bug (dfsm.c:241:dfsm_send_message_full_debug_state)=20
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: cpg_handle 5109470997=
161967616 (dfsm.c:242:dfsm_send_message_full_debug_state)=20
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[0] len 32 (dfsm.c=
:244:dfsm_send_message_full_debug_state)=20
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: iov[1] len 64792 (dfs=
m.c:244:dfsm_send_message_full_debug_state)=20
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message lo=
op body (dfsm.c:246:dfsm_send_message_full_debug_state)=20
Sep 25 17:57:39 m6kvm13 pmxcfs[10086]: [dcdb] notice: send state message re=
sult: 1 (dfsm.c:250:dfsm_send_message_full_debug_state)=20

but this never happens. the relevant code from our patched dfsm.c=20
(reordered):=20

static cs_error_t=20
dfsm_send_state_message_full(=20
dfsm_t *dfsm,=20
uint16_t type,=20
struct iovec *iov,=20
unsigned int len)=20
{=20
g_return_val_if_fail(dfsm !=3D NULL, CS_ERR_INVALID_PARAM);=20
g_return_val_if_fail(DFSM_VALID_STATE_MESSAGE(type), CS_ERR_INVALID_PARAM);=
=20
g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM);=20

// this message is still logged=20
cfs_dom_debug(dfsm->log_domain, "dfsm_send_state_message_full: type %d len =
%d", type, len);=20

// everything below this point might not have happened anymore=20

dfsm_message_state_header_t header;=20
header.base.type =3D type;=20
header.base.subtype =3D 0;=20
header.base.protocol_version =3D dfsm->protocol_version;=20
header.base.time =3D time(NULL);=20
header.base.reserved =3D 0;=20

header.epoch =3D dfsm->sync_epoch;=20

struct iovec real_iov[len + 1];=20

real_iov[0].iov_base =3D (char *)&header;=20
real_iov[0].iov_len =3D sizeof(header);=20

for (int i =3D 0; i < len; i++)=20
real_iov[i + 1] =3D iov[i];=20

return type =3D=3D DFSM_MESSAGE_STATE ? dfsm_send_message_full_debug_state(=
dfsm, real_iov, len + 1, 1) : dfsm_send_message_full(dfsm, real_iov, len + =
1, 1);=20
}=20

static cs_error_t=20
dfsm_send_message_full_debug_state(=20
dfsm_t *dfsm,=20
struct iovec *iov,=20
unsigned int len,=20
int retry)=20
{=20
g_return_val_if_fail(dfsm !=3D NULL, CS_ERR_INVALID_PARAM);=20
g_return_val_if_fail(!len || iov !=3D NULL, CS_ERR_INVALID_PARAM);=20

struct timespec tvreq =3D { .tv_sec =3D 0, .tv_nsec =3D 100000000 };=20
cs_error_t result;=20
int retries =3D 0;=20

// this message is not visible in log=20
// we don't know how far above this we managed to run=20
cfs_dom_message(dfsm->log_domain, "send state message debug");=20
cfs_dom_message(dfsm->log_domain, "cpg_handle %" PRIu64, dfsm->cpg_handle);=
=20
for (int i =3D 0; i < len; i++)=20
cfs_dom_message(dfsm->log_domain, "iov[%d] len %zd", i, iov[i].iov_len);=20
loop:=20
cfs_dom_message(dfsm->log_domain, "send state message loop body");=20

result =3D cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);=
=20

cfs_dom_message(dfsm->log_domain, "send state message result: %d", result);=
=20
if (retry && result =3D=3D CS_ERR_TRY_AGAIN) {=20
nanosleep(&tvreq, NULL);=20
++retries;=20
if ((retries % 10) =3D=3D 0)=20
cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);=20
if (retries < 100)=20
goto loop;=20
}=20

if (retries)=20
cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retr=
ies);=20

if (result !=3D CS_OK &&=20
(!retry || result !=3D CS_ERR_TRY_AGAIN))=20
cfs_dom_critical(dfsm->log_domain, "cpg_send_message failed: %d", result);=
=20

return result;=20
}=20

I don't see much that could go wrong inside dfsm_send_state_message_full=20
after the debug log statement (it's just filling out the header and iov=20
structs, and calling 'dfsm_send_message_full_debug_state' since type =3D=3D=
=20
2 =3D=3D DFSM_MESSAGE_STATE).=20

inside dfsm_send_message_full_debug, before the first log statement we=20
only check for dfsm or the message length/content being NULL/0, all of=20
which can't really happen with that call path. also, in that case we'd=20
return CS_ERR_INVALID_PARAM , which would bubble up into the delivery=20
callback and cause us to leave CPG, which would again be visible in the=20
logs..=20

but, just to make sure, could you reproduce the issue once more, and=20
then (with debug symbols installed) run=20

$ gdb -ex 'set pagination 0' -ex 'thread apply all bt' --batch -p $(pidof p=
mxcfs)=20

on all nodes at the same time? this should minimize the fallout and show=20
us whether the thread that logged the first part of sending the state is=20
still around on the node that triggers the hang..=20

> Something new: where doing coredump or bt-full on pmxcfs on node1,=20
> this have unlocked /etc/pve on other nodes=20
>=20
> /etc/pve unlocked(with coredump or bt-full): 17:57:40=20

this looks like you bt-ed corosync this time around? if so, than this is=20
expected:=20

- attach gdb to corosync=20
- corosync blocks=20
- other nodes notice corosync is gone on node X=20
- config change triggered=20
- sync restarts on all nodes, does not trigger bug this time=20


_______________________________________________=20
pve-devel mailing list=20
pve-devel@lists.proxmox.com=20
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel=20


_______________________________________________=20
pve-devel mailing list=20
pve-devel@lists.proxmox.com=20
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel=20