From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id B7EE8840ED for ; Fri, 10 Dec 2021 16:44:42 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 944A728BA2 for ; Fri, 10 Dec 2021 16:44:42 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 5F4BC28B94 for ; Fri, 10 Dec 2021 16:44:41 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C561245A82; Fri, 10 Dec 2021 16:34:33 +0100 (CET) Message-ID: Date: Fri, 10 Dec 2021 16:34:32 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:96.0) Gecko/20100101 Thunderbird/96.0 Content-Language: en-US To: Stefan Radman , PVE User List References: <9005E17C-0114-4AD3-9CED-D3615E853F7B@me.com> From: Thomas Lamprecht In-Reply-To: <9005E17C-0114-4AD3-9CED-D3615E853F7B@me.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 1.486 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -2.803 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Subject: Re: [PVE-User] watchdog timeout hardcoded to 10 sec X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Dec 2021 15:44:42 -0000 Hi, On 10.12.21 15:22, Stefan Radman wrote: > What is the reason for hardcoding the watchdog timeout into pve-ha-manager/watchdog-mux.c? Note that this is the multiplexer, the actual timeout for its clients is 60s. The MUX opens the actual watchdog, it's a really small C program with a very small footprint and static resource usage, so it won't ever fail to update the watchdog in any situation where the system isn't total lost. The MUX then checks the actual clients, if those did not ping in the last 60s the MUX will stop updating the actual watchdog, causing a reset around 0s to 10s later. So the in-practice timeout for the watchdog services the MUX provides is 60 to 70 seconds, not ten. > > https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33 > 33 int watchdog_timeout = 10; > https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157 > 157 if (ioctl(watchdog_fd, WDIOC_SETTIMEOUT, &watchdog_timeout) == -1) { > > I am trying to use a more conservative 5 minute timeout for the IPMI watchdog but it gets changed to 10 seconds when the watchdog-mux.service starts. That's not a reasonable timeout for Proxmox VE's HA self fencing as pmxcfs locks have a timeout of 2 minutes, if you go above that all consistency guarantees from the self fencing are void and a HA Service can be recovered while the original one still access some of its resources, iow. there be dragons. ps. Personally I'd only rely on a HW watchdog if I'm really sure it runs stable, most of the time their firmware is just a mess and they have so many bugs that the softdog of the kernel, which itself is a quite small and simple kernel module, works more stable. YMMV, but I never saw a situation where the softdog didn't do its job but we got some report of failing HW watchdogs - not /that/ many, but most users go for the default setup so this may be biased. hope that helps, Thomas