From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 7841C1FF136 for ; Mon, 18 May 2026 16:12:02 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id E4FD414730; Mon, 18 May 2026 16:11:59 +0200 (CEST) Date: Mon, 18 May 2026 16:11:22 +0200 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= Subject: Re: [PATCH cluster v3 4/8] add functions to determine warning level for high token timeouts To: Michael =?iso-8859-1?b?S/ZwcGw=?= , pve-devel@lists.proxmox.com References: <20260427170548.307698-1-m.koeppl@proxmox.com> <20260427170548.307698-5-m.koeppl@proxmox.com> In-Reply-To: <20260427170548.307698-5-m.koeppl@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid) Message-Id: <1779112787.r3e8af5igi.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1779113473683 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.054 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [corosync.pm] Message-ID-Hash: GJDJTUGES2M2N5J7RUY37EMPEWXYGHP4 X-Message-ID-Hash: GJDJTUGES2M2N5J7RUY37EMPEWXYGHP4 X-MailFrom: f.gruenbichler@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On April 27, 2026 7:05 pm, Michael K=C3=B6ppl wrote: > High token timeouts can lead to stability problems in clusters. To > inform users about the timeout in their current setup (or expected > timeouts when adding nodes) and give recommendations regarding the token > coefficient setting, introduce function to calculate the timeout as well > as determine the warning / recommendation levels. >=20 > Signed-off-by: Michael K=C3=B6ppl > --- > src/PVE/Corosync.pm | 50 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 50 insertions(+) >=20 > diff --git a/src/PVE/Corosync.pm b/src/PVE/Corosync.pm > index aef0d31..45a1f71 100644 > --- a/src/PVE/Corosync.pm > +++ b/src/PVE/Corosync.pm > @@ -534,4 +534,54 @@ sub resolve_hostname_like_corosync { > return $match_ip_and_version->($resolved_ip); > } > =20 > +sub calculate_membership_recovery_timeout { > + my ($totemcfg, $node_count) =3D @_; > + > + my $token_timeout =3D $totemcfg->{token} // 3000; > + my $token_coefficient =3D $totemcfg->{token_coefficient} // 650; > + > + my $expected_token_timeout =3D $token_timeout; > + if ($node_count > 2) { > + $expected_token_timeout +=3D ($node_count - 2) * $token_coeffici= ent; > + } > + > + my $expected_consensus_timeout =3D $totemcfg->{consensus} // $expect= ed_token_timeout * 1.2; > + return ($expected_token_timeout + $expected_consensus_timeout) / 100= 0.0; we could also ask corosync (via corosync-cmapctl) about most of these, to avoid duplicating the calculations/defaults. the only thing missing is the coefficient, though we could probably expose that on the corosync side as well. > +} > + > +sub get_membership_recovery_timeout_warning_level { > + my ($total_timeout_secs) =3D @_; > + > + if ($total_timeout_secs > 45) { > + return 'change-strongly-recommended'; > + } elsif ($total_timeout_secs > 40) { > + return 'change-recommended'; > + } elsif ($total_timeout_secs > 30) { > + return 'optimize'; > + } > + > + return undef; > +} > + > +sub get_membership_recovery_timeout_warning { > + my ($total_timeout_secs) =3D @_; > + > + my $level =3D get_membership_recovery_timeout_warning_level($total_t= imeout_secs); > + return undef if !defined($level); > + > + my $level_msg; > + if ($level eq 'change-strongly-recommended') { > + $level_msg =3D "Lowering the token coefficient is strongly recom= mended"; > + } elsif ($level eq 'change-recommended') { > + $level_msg =3D "Lowering the token coefficient is recommended"; > + } elsif ($level eq 'optimize') { > + $level_msg =3D "The token coefficient can be optimized"; > + } > + > + return > + "Sum of Corosync token and consensus timeout is ${total_timeout_= secs}s. " > + . "$level_msg. " > + . "See 'man pvecm' for details."; this pretty much duplicates the frontend code - if we leave out the last line we could just return the warning message, and call the field in the API return value "totem_warning(s)" or "health_warnings" or just "warnings" and potentially add more information in the future? we could still keep the level and return warnings =3D [=20 level =3D> ..., msg =3D> ..., ] but I don't currently see a reason why we'd benefit from returning raw values and constructing the warning message on both ends? > +} > + > 1; > --=20 > 2.47.3 >=20 >=20 >=20 >=20 >=20 >=20