From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 4A8931FF13E for ; Fri, 17 Apr 2026 10:33:26 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 2D2FE1986D; Fri, 17 Apr 2026 10:33:26 +0200 (CEST) Message-ID: Date: Fri, 17 Apr 2026 10:33:16 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH cluster 1/5] add functions to determine warning level for high token timeouts To: =?UTF-8?Q?Michael_K=C3=B6ppl?= , pve-devel@lists.proxmox.com References: <20260330144321.321072-1-m.koeppl@proxmox.com> <20260330144321.321072-2-m.koeppl@proxmox.com> Content-Language: en-US From: Friedrich Weber In-Reply-To: <20260330144321.321072-2-m.koeppl@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776414716927 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.013 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: BX55WOUQ6GAZS2YGFAMHV34FN6M73W24 X-Message-ID-Hash: BX55WOUQ6GAZS2YGFAMHV34FN6M73W24 X-MailFrom: f.weber@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 30/03/2026 16:46, Michael Köppl wrote: > High token timeouts can lead to stability problems in clusters. To > inform users about the timeout in their current setup (or expected > timeouts when adding nodes) and give recommendations regarding the token > coefficient setting, introduce function to calculate the timeout as well > as determine the warning / recommendation levels. > > Signed-off-by: Michael Köppl > --- > The timeouts are chosen according to Friedrich's description in [0]. > > [0] https://bugzilla.proxmox.com/show_bug.cgi?id=7398 > > src/PVE/Corosync.pm | 50 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 50 insertions(+) > > diff --git a/src/PVE/Corosync.pm b/src/PVE/Corosync.pm > index aef0d31..41d4c6f 100644 > --- a/src/PVE/Corosync.pm > +++ b/src/PVE/Corosync.pm > @@ -534,4 +534,54 @@ sub resolve_hostname_like_corosync { > return $match_ip_and_version->($resolved_ip); > } > > +sub calculate_total_timeout { I think "total timeout" is a little too vague, especially because it's also user-facing. I don't think totem/corosync have a term for "sum of token and consensus timeout", and "sum of token and consensus timeout" is a too long. Perhaps something like "recovery timeout" -- though not perfect because "Recovery" is a specific state in the totem state machine. Maybe "membership convergence timeout" (though that's a bit long and obscure)? > + my ($totemcfg, $node_count) = @_; > + > + my $token_timeout = $totemcfg->{token} // 3000; > + my $token_coefficient = $totemcfg->{token_coefficient} // 650; > + > + my $expected_token_timeout = $token_timeout; > + if ($node_count > 2) { > + $expected_token_timeout += ($node_count - 2) * $token_coefficient; > + } > + > + my $expected_consensus_timeout = $totemcfg->{consensus} // $expected_token_timeout * 1.2; > + return ($expected_token_timeout + $expected_consensus_timeout) / 1000.0; > +} > + > +sub get_timeout_warning_level { > + my ($total_timeout_secs) = @_; > + > + if ($total_timeout_secs > 50) { > + return 'change-strongly-recommended'; I realize I'm the source of these numbers :) But since >50 is actually pretty bad already, if we phrase it as "strongly recommended" we can probably go for a slightly lower number: - > 45: change-strongly-recommended - > 40: change recommended - > 30: optimize > + } elsif ($total_timeout_secs > 40) { > + return 'change-recommended'; > + } elsif ($total_timeout_secs > 30) { > + return 'optimize'; > + } > + > + return undef; > +} > + > +sub get_timeout_warning { > + my ($total_timeout_secs) = @_; > + > + my $level = get_timeout_warning_level($total_timeout_secs); > + return undef if !defined($level); > + > + my $level_msg; > + if ($level eq 'change-strongly-recommended') { > + $level_msg = "Changing the token coefficient is strongly recommended"; > + } elsif ($level eq 'change-recommended') { > + $level_msg = "Changing the token coefficient is recommended"; > + } elsif ($level eq 'optimize') { > + $level_msg = "Token coefficient can be optimized"; > + } > + > + return > + "Sum of Corosync token and consensus timeout is ${total_timeout_secs}s. " > + . "$level_msg. " > + . "See https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_changing_the_token_coefficient for details."; > +} > + > 1;