From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <t.lamprecht@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id B7EE8840ED
 for <pve-user@pve.proxmox.com>; Fri, 10 Dec 2021 16:44:42 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 944A728BA2
 for <pve-user@pve.proxmox.com>; Fri, 10 Dec 2021 16:44:42 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 5F4BC28B94
 for <pve-user@pve.proxmox.com>; Fri, 10 Dec 2021 16:44:41 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C561245A82;
 Fri, 10 Dec 2021 16:34:33 +0100 (CET)
Message-ID: <f4d0a324-4d6e-1f9c-17b0-46912f6454dc@proxmox.com>
Date: Fri, 10 Dec 2021 16:34:32 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:96.0) Gecko/20100101
 Thunderbird/96.0
Content-Language: en-US
To: Stefan Radman <stefan.radman@me.com>,
 PVE User List <pve-user@pve.proxmox.com>
References: <9005E17C-0114-4AD3-9CED-D3615E853F7B@me.com>
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
In-Reply-To: <9005E17C-0114-4AD3-9CED-D3615E853F7B@me.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 1.486 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -2.803 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com]
Subject: Re: [PVE-User] watchdog timeout hardcoded to 10 sec
X-BeenThere: pve-user@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
List-Post: <mailto:pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Fri, 10 Dec 2021 15:44:42 -0000

Hi,

On 10.12.21 15:22, Stefan Radman wrote:
> What is the reason for hardcoding the watchdog timeout into pve-ha-manager/watchdog-mux.c?

Note that this is the multiplexer, the actual timeout for its clients is 60s.

The MUX opens the actual watchdog, it's a really small C program with a very small
footprint and static resource usage, so it won't ever fail to update the watchdog
in any situation where the system isn't total lost.

The MUX then checks the actual clients, if those did not ping in the last 60s the
MUX will stop updating the actual watchdog, causing a reset around 0s to 10s later.

So the in-practice timeout for the watchdog services the MUX provides is 60 to 70
seconds, not ten.

> 
> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33>
>   33 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33> int watchdog_timeout = 10;
> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>
>  157 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>     if (ioctl(watchdog_fd, WDIOC_SETTIMEOUT, &watchdog_timeout) == -1) {
> 
> I am trying to use a more conservative 5 minute timeout for the IPMI watchdog but it gets changed to 10 seconds when the watchdog-mux.service starts.

That's not a reasonable timeout for Proxmox VE's HA self fencing as pmxcfs locks have
a timeout of 2 minutes, if you go above that all consistency guarantees from the self
fencing are void and a HA Service can be recovered while the original one still access
some of its resources, iow. there be dragons.

ps. Personally I'd only rely on a HW watchdog if I'm really sure it runs stable, most
of the time their firmware is just a mess and they have so many bugs that the softdog
of the kernel, which itself is a quite small and simple kernel module, works more
stable. YMMV, but I never saw a situation where the softdog didn't do its job but we
got some report of failing HW watchdogs - not /that/ many, but most users go for the
default setup so this may be biased.

hope that helps,
Thomas