From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <t.lamprecht@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 5B565607CE
 for <pve-devel@lists.proxmox.com>; Wed,  9 Sep 2020 22:06:22 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 442E714AFB
 for <pve-devel@lists.proxmox.com>; Wed,  9 Sep 2020 22:05:52 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id CBAC014AEE
 for <pve-devel@lists.proxmox.com>; Wed,  9 Sep 2020 22:05:50 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 96BB544AD4;
 Wed,  9 Sep 2020 22:05:50 +0200 (CEST)
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
 Alexandre DERUMIER <aderumier@odiso.com>, dietmar <dietmar@proxmox.com>
References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com>
 <1059698258.392627.1599381816979.JavaMail.zimbra@odiso.com>
 <508219973.775.1599394447359@webmail.proxmox.com>
 <1661182651.406890.1599463180810.JavaMail.zimbra@odiso.com>
 <72727125.827.1599466723564@webmail.proxmox.com>
 <1066029576.414316.1599471133463.JavaMail.zimbra@odiso.com>
 <872332597.423950.1599485006085.JavaMail.zimbra@odiso.com>
 <1551800621.910.1599540071310@webmail.proxmox.com>
 <1680829869.439013.1599549082330.JavaMail.zimbra@odiso.com>
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
Message-ID: <e80f1080-253d-c43c-4402-258855bcbf18@proxmox.com>
Date: Wed, 9 Sep 2020 22:05:49 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:81.0) Gecko/20100101
 Thunderbird/81.0
MIME-Version: 1.0
In-Reply-To: <1680829869.439013.1599549082330.JavaMail.zimbra@odiso.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.630 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -1.626 Looks like a legit reply (A)
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean
 shutdown
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 09 Sep 2020 20:06:22 -0000

On 08.09.20 09:11, Alexandre DERUMIER wrote:
>>> It would really help if we can reproduce the bug somehow. Do you have=
 and idea how
>>> to trigger the bug?
>=20
> I really don't known. I'm currently trying to reproduce on the same clu=
ster, with softdog && noboot=3D1, and rebooting node.
>=20
>=20
> Maybe it's related with the number of vms, or the number of nodes, don'=
t have any clue ...

I checked a bit the watchdog code, our user-space mux one and the kernel =
drivers,
and just noting a few things here (thinking out aloud):

The /dev/watchdog itself is always active, else we could  loose it to som=
e
other program and not be able to activate HA dynamically.
But, as long as no HA service got active, it's a simple dummy "wake up ev=
ery
second and do an ioctl keep-alive update".
This is really simple and efficiently written, so if that fails for over =
10s
the systems is really loaded, probably barely responding to anything.

Currently the watchdog-mux runs as normal process, no re-nice, no real-ti=
me
scheduling. This is IMO wrong, as it is a critical process which needs to=
 be
run with high priority. I've a patch here which sets it to the highest RR=

realtime-scheduling priority available, effectively the same what corosyn=
c does.


diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index 818ae00..71981d7 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -8,2 +8,3 @@
 #include <time.h>
+#include <sched.h>
 #include <sys/ioctl.h>
@@ -151,2 +177,15 @@ main(void)
=20
+    int sched_priority =3D sched_get_priority_max (SCHED_RR);
+    if (sched_priority !=3D -1) {
+        struct sched_param global_sched_param;
+        global_sched_param.sched_priority =3D sched_priority;
+        int res =3D sched_setscheduler (0, SCHED_RR, &global_sched_param=
);
+        if (res =3D=3D -1) {
+            fprintf(stderr, "Could not set SCHED_RR at priority %d\n", s=
ched_priority);
+        } else {
+            fprintf(stderr, "set SCHED_RR at priority %d\n", sched_prior=
ity);
+        }
+    }
+
+
     if ((watchdog_fd =3D open(WATCHDOG_DEV, O_WRONLY)) =3D=3D -1) {

The issue with no HA but watchdog reset due to massively overloaded syste=
m
should be avoided already a lot with the scheduling change alone.

Interesting, IMO, is that lots of nodes rebooted at the same time, with n=
o HA active.
This *could* come from a side-effect like ceph rebalacing kicking off and=
 producing
a load spike for >10s, hindering the scheduling of the watchdog-mux.
This is a theory, but with HA off it needs to be something like that, as =
in HA-off
case there's *no* direct or indirect connection between corosync/pmxcfs a=
nd the
watchdog-mux. It simply does not cares, or notices, quorum partition chan=
ges at all.


There may be a approach to reserve the watchdog for the mux, but avoid ha=
ving it
as "ticking time bomb":
Theoretically one could open it, then disable it with an ioctl (it can be=
 queried
if a driver support that) and only enable it for real once the first clie=
nt connects
to the MUX. This may not work for all watchdog modules, and if, we may wa=
nt to make
it configurable, as some people actually want a reset if a (future) real-=
time process
cannot be scheduled for >=3D 10 seconds.

With HA active, well then there could be something off, either in corosyn=
c/knet or
also in how we interface with it in pmxcfs, that could well be, but won't=
 explain the
non-HA issues.

Speaking of pmxcfs, that one runs also with standard priority, we may wan=
t to change
that too to a RT scheduler, so that its ensured it can process all corosy=
nc events.

I have also a few other small watchdog mux patches around, it should nowa=
days actually
be able to tell us why a reset happened (can also be over/under voltage, =
temperature,
=2E..) and I'll repeat doing the ioctl for keep-alive a few times if it f=
ails, can only
win with that after all.