From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 18B441FF15C for ; Wed, 8 Jan 2025 13:57:28 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B0E3819B91; Wed, 8 Jan 2025 13:57:10 +0100 (CET) Date: Wed, 08 Jan 2025 12:02:14 +0000 To: iztok.gregori@elettra.eu In-Reply-To: <1be81920-ed5b-4b96-938a-4f35551b9ce5@elettra.eu> References: <17af1712-1aa7-4f72-bd90-1e45d1361e45@elettra.eu> <061153a5c032dd89e04d7e3ef54b8fbcdce5fb24.camel@groupe-cyllene.com> <1be81920-ed5b-4b96-938a-4f35551b9ce5@elettra.eu> MIME-Version: 1.0 Message-ID: List-Id: Proxmox VE user list List-Post: From: Alwin Antreich via pve-user Precedence: list Cc: Alwin Antreich , Proxmox VE user list X-Mailman-Version: 2.1.29 X-BeenThere: pve-user@lists.proxmox.com List-Subscribe: , List-Unsubscribe: , List-Archive: Reply-To: Proxmox VE user list List-Help: Subject: Re: [PVE-User] Corosync and Cluster reboot Content-Type: multipart/mixed; boundary="===============3408801441494271379==" Errors-To: pve-user-bounces@lists.proxmox.com Sender: "pve-user" --===============3408801441494271379== Content-Type: message/rfc822 Content-Disposition: inline Return-Path: X-Original-To: pve-user@lists.proxmox.com Delivered-To: pve-user@lists.proxmox.com Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 70AE3CB205 for ; Wed, 8 Jan 2025 13:57:09 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 5170019ADD for ; Wed, 8 Jan 2025 13:57:09 +0100 (CET) Received: from mx.antreich.com (mx.antreich.com [173.249.42.230]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 8 Jan 2025 13:57:07 +0100 (CET) Received: from mail2.antreich.com (unknown [172.16.9.25]) by mx.antreich.com (Postfix) with ESMTPS id 22ABD6E2E52; Wed, 8 Jan 2025 13:02:15 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=antreich.com; s=2018; t=1736337735; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=B6TniFmksP0xFkYgeyur76/+gKTUUnAl9A/e3e05FZ4=; b=ZXfQP+hf2ocyl8XoerERUF/1HJbWFk32FdvGNdILGiCTs5KUAgoMV7/Wfez0lNPwA9qQVo qCFp30clwEfdkuoGX/DhFc6lA5HN+hXUML1AadtZS9KQu0YBLnNSB9YFXp6+u6y57jG4Pc IiSeUPlG+YSZmjcLvnVWt+nga7FmL8L8wHKyZhpgeIktODzR8I42HfaBZXxwRdKyKAu+Mm I0PU+YnFtYBvRw4AUc2DXSix3QSCXfLbRknJHVkGYZw/i0F0gBt0VGVroJwbt+J7Y0CP8H BLO1QmCUN2cCPlTFvRsKFC7k/7BHVQYUrp8A9XCeu7bhEI5Var9XL1+legNlVQ== MIME-Version: 1.0 Date: Wed, 08 Jan 2025 12:02:14 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: "Alwin Antreich" Message-ID: TLS-Required: No Subject: Re: [PVE-User] Corosync and Cluster reboot To: iztok.gregori@elettra.eu Cc: "Proxmox VE user list" In-Reply-To: <1be81920-ed5b-4b96-938a-4f35551b9ce5@elettra.eu> References: <17af1712-1aa7-4f72-bd90-1e45d1361e45@elettra.eu> <061153a5c032dd89e04d7e3ef54b8fbcdce5fb24.camel@groupe-cyllene.com> <1be81920-ed5b-4b96-938a-4f35551b9ce5@elettra.eu> X-SPAM-LEVEL: Spam detection results: 0 AWL 0.101 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain DMARC_PASS -0.1 DMARC pass policy RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_PASS -0.001 SPF: HELO matches SPF record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [antreich.com,croit.io] Hi Iztok, January 8, 2025 at 11:12 AM, "Iztok Gregori" wrote: >=20 >=20Hi! >=20 >=20On 07/01/25 15:15, DERUMIER, Alexandre wrote: >=20 >=20>=20 >=20> Personnaly, I'll recommand to disable HA temporary during the netwo= rk change (mv /etc/pve/ha/resources.cfg to a tmp directory, stop all pve-= ha-lrm , tehn stop all pve-ha-crm to stop the watchdog) > >=20=20 >=20> Then, after the migration, check the corosync logs during 1 or 2 d= ays , and after that , if no retransmit occur, reenable HA. > >=20 >=20Good advice. But with the pve-ha-* services down the "HA-VMs" cannot= =20 >=20migrate from a node to the other, because the migration is handled by= =20 >=20the HA (or at least that is how I remember to happen some time ago). = So=20 >=20I've (temporary) removed all the resources (VMs) from HA, which has t= he=20 >=20effect to tell "pve-ha-lrm" to disable the watchdog( "watchdog closed= =20 >=20(disabled)" ) and no reboot should occur. Yes, after a minute or two when no resource is under HA the watchdog is c= losed (lrm becomes idle). I second Alexandre's recommendation when working on the corosync network/= config. >=20 >=20>=20 >=20> It's really possible that it's a corosync bug (I remember to have h= ad this kind of error with pve 7.X) > >=20 >=20I'm leaning to a similar conclusion, but I'm still lacking in=20 >=20understanding of how corosync/watchdog is handled in Proxmox. >=20 >=20For example I still don't know who is updating the watchdog-mux servi= ce?=20 >=20Is corosync (but no "watchdog_device" is set in corosync.conf and by= =20 >=20manual "if unset, empty or "off", no watchdog is used.") or is pve-ha= -lrm? The watchdog-mux service is handled by the LRM service. The LRM is holding a lock in /etc/pve when it becomes active. This allow = the node to fence itself, since the watchdog isn't updated anymore when t= he node drops out of quorum. By default the softdog is used, but it can b= e changed to a hardware watchdog in /etc/default/pve-ha-manger. >=20 >=20I think that, after the migration, my best shot is to upgrade the=20 >=20cluster, but I have to understand if newer libcephfs client libraries= =20 >=20support old Ceph clusters. Ceph usually guarantees compatibility between two-ish major versions (eg.= Quincy -> Squid, Pacific -> Reef; unless stated otherwise). Any bigger version difference usually works as well, but it is strongly r= ecommended to upgrade ceph as there have been numerous bugs fixed the pas= t years. Cheers, Alwin -- croit GmbH, Consulting / Training / 24x7 Support https://www.croit.io/services/proxmox --===============3408801441494271379== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ pve-user mailing list pve-user@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user --===============3408801441494271379==--