From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-user-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 18B441FF15C
	for <inbox@lore.proxmox.com>; Wed,  8 Jan 2025 13:57:28 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id B0E3819B91;
	Wed,  8 Jan 2025 13:57:10 +0100 (CET)
Date: Wed, 08 Jan 2025 12:02:14 +0000
To: iztok.gregori@elettra.eu
In-Reply-To: <1be81920-ed5b-4b96-938a-4f35551b9ce5@elettra.eu>
References: <17af1712-1aa7-4f72-bd90-1e45d1361e45@elettra.eu>
 <CAOKSTBvn7mAPJXWJXZ6ZjD4J4+fAGP44rp2hambXGdmGqZ5TVw@mail.gmail.com>
 <CAOKSTBuFw1ihaCA7AF_iDHaSbHJXHREGLVmdPPuFEkR9L3Zjsg@mail.gmail.com>
 <061153a5c032dd89e04d7e3ef54b8fbcdce5fb24.camel@groupe-cyllene.com>
 <1be81920-ed5b-4b96-938a-4f35551b9ce5@elettra.eu>
MIME-Version: 1.0
Message-ID: <mailman.151.1736341029.441.pve-user@lists.proxmox.com>
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Post: <mailto:pve-user@lists.proxmox.com>
From: Alwin Antreich via pve-user <pve-user@lists.proxmox.com>
Precedence: list
Cc: Alwin Antreich <alwin@antreich.com>,
 Proxmox VE user list <pve-user@lists.proxmox.com>
X-Mailman-Version: 2.1.29
X-BeenThere: pve-user@lists.proxmox.com
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
Reply-To: Proxmox VE user list <pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
Subject: Re: [PVE-User] Corosync and Cluster reboot
Content-Type: multipart/mixed; boundary="===============3408801441494271379=="
Errors-To: pve-user-bounces@lists.proxmox.com
Sender: "pve-user" <pve-user-bounces@lists.proxmox.com>

--===============3408801441494271379==
Content-Type: message/rfc822
Content-Disposition: inline

Return-Path: <alwin@antreich.com>
X-Original-To: pve-user@lists.proxmox.com
Delivered-To: pve-user@lists.proxmox.com
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by lists.proxmox.com (Postfix) with ESMTPS id 70AE3CB205
	for <pve-user@lists.proxmox.com>; Wed,  8 Jan 2025 13:57:09 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 5170019ADD
	for <pve-user@lists.proxmox.com>; Wed,  8 Jan 2025 13:57:09 +0100 (CET)
Received: from mx.antreich.com (mx.antreich.com [173.249.42.230])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by firstgate.proxmox.com (Proxmox) with ESMTPS
	for <pve-user@lists.proxmox.com>; Wed,  8 Jan 2025 13:57:07 +0100 (CET)
Received: from mail2.antreich.com (unknown [172.16.9.25])
	by mx.antreich.com (Postfix) with ESMTPS id 22ABD6E2E52;
	Wed,  8 Jan 2025 13:02:15 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=antreich.com;
	s=2018; t=1736337735;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=B6TniFmksP0xFkYgeyur76/+gKTUUnAl9A/e3e05FZ4=;
	b=ZXfQP+hf2ocyl8XoerERUF/1HJbWFk32FdvGNdILGiCTs5KUAgoMV7/Wfez0lNPwA9qQVo
	qCFp30clwEfdkuoGX/DhFc6lA5HN+hXUML1AadtZS9KQu0YBLnNSB9YFXp6+u6y57jG4Pc
	IiSeUPlG+YSZmjcLvnVWt+nga7FmL8L8wHKyZhpgeIktODzR8I42HfaBZXxwRdKyKAu+Mm
	I0PU+YnFtYBvRw4AUc2DXSix3QSCXfLbRknJHVkGYZw/i0F0gBt0VGVroJwbt+J7Y0CP8H
	BLO1QmCUN2cCPlTFvRsKFC7k/7BHVQYUrp8A9XCeu7bhEI5Var9XL1+legNlVQ==
MIME-Version: 1.0
Date: Wed, 08 Jan 2025 12:02:14 +0000
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: "Alwin Antreich" <alwin@antreich.com>
Message-ID: <eecf1a85ed7de3658e28e654ea63759fdc08292a@antreich.com>
TLS-Required: No
Subject: Re: [PVE-User] Corosync and Cluster reboot
To: iztok.gregori@elettra.eu
Cc: "Proxmox VE user list" <pve-user@lists.proxmox.com>
In-Reply-To: <1be81920-ed5b-4b96-938a-4f35551b9ce5@elettra.eu>
References: <17af1712-1aa7-4f72-bd90-1e45d1361e45@elettra.eu>
 <CAOKSTBvn7mAPJXWJXZ6ZjD4J4+fAGP44rp2hambXGdmGqZ5TVw@mail.gmail.com>
 <CAOKSTBuFw1ihaCA7AF_iDHaSbHJXHREGLVmdPPuFEkR9L3Zjsg@mail.gmail.com>
 <061153a5c032dd89e04d7e3ef54b8fbcdce5fb24.camel@groupe-cyllene.com>
 <1be81920-ed5b-4b96-938a-4f35551b9ce5@elettra.eu>
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.101 Adjusted score from AWL reputation of From: address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DKIM_SIGNED               0.1 Message has a DKIM or DK signature, not necessarily valid
	DKIM_VALID               -0.1 Message has at least one valid DKIM or DK signature
	DKIM_VALID_AU            -0.1 Message has a valid DKIM or DK signature from author's domain
	DKIM_VALID_EF            -0.1 Message has a valid DKIM or DK signature from envelope-from domain
	DMARC_PASS               -0.1 DMARC pass policy
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked.  See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked.  See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked.  See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
	SPF_HELO_PASS          -0.001 SPF: HELO matches SPF record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [antreich.com,croit.io]

Hi Iztok,


January 8, 2025 at 11:12 AM, "Iztok Gregori" <iztok.gregori@elettra.eu ma=
ilto:iztok.gregori@elettra.eu?to=3D%22Iztok%20Gregori%22%20%3Ciztok.grego=
ri%40elettra.eu%3E > wrote:


>=20
>=20Hi!
>=20
>=20On 07/01/25 15:15, DERUMIER, Alexandre wrote:
>=20
>=20>=20
>=20> Personnaly, I'll recommand to disable HA temporary during the netwo=
rk change (mv /etc/pve/ha/resources.cfg to a tmp directory, stop all pve-=
ha-lrm , tehn stop all pve-ha-crm to stop the watchdog)
> >=20=20
>=20>  Then, after the migration, check the corosync logs during 1 or 2 d=
ays , and after that , if no retransmit occur, reenable HA.
> >=20
>=20Good advice. But with the pve-ha-* services down the "HA-VMs" cannot=
=20
>=20migrate from a node to the other, because the migration is handled by=
=20
>=20the HA (or at least that is how I remember to happen some time ago). =
So=20
>=20I've (temporary) removed all the resources (VMs) from HA, which has t=
he=20
>=20effect to tell "pve-ha-lrm" to disable the watchdog( "watchdog closed=
=20
>=20(disabled)" ) and no reboot should occur.
Yes, after a minute or two when no resource is under HA the watchdog is c=
losed (lrm becomes idle).
I second Alexandre's recommendation when working on the corosync network/=
config.

>=20
>=20>=20
>=20> It's really possible that it's a corosync bug (I remember to have h=
ad this kind of error with pve 7.X)
> >=20
>=20I'm leaning to a similar conclusion, but I'm still lacking in=20
>=20understanding of how corosync/watchdog is handled in Proxmox.
>=20
>=20For example I still don't know who is updating the watchdog-mux servi=
ce?=20
>=20Is corosync (but no "watchdog_device" is set in corosync.conf and by=
=20
>=20manual "if unset, empty or "off", no watchdog is used.") or is pve-ha=
-lrm?
The watchdog-mux service is handled by the LRM service.
The LRM is holding a lock in /etc/pve when it becomes active. This allow =
the node to fence itself, since the watchdog isn't updated anymore when t=
he node drops out of quorum. By default the softdog is used, but it can b=
e changed to a hardware watchdog in /etc/default/pve-ha-manger.

>=20
>=20I think that, after the migration, my best shot is to upgrade the=20
>=20cluster, but I have to understand if newer libcephfs client libraries=
=20
>=20support old Ceph clusters.
Ceph usually guarantees compatibility between two-ish major versions (eg.=
 Quincy -> Squid, Pacific -> Reef; unless stated otherwise).
Any bigger version difference usually works as well, but it is strongly r=
ecommended to upgrade ceph as there have been numerous bugs fixed the pas=
t years.

Cheers,
Alwin
--
croit GmbH,
Consulting / Training / 24x7 Support
https://www.croit.io/services/proxmox


--===============3408801441494271379==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

--===============3408801441494271379==--