From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 0D0611FF136 for ; Mon, 18 May 2026 17:40:32 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 07D0C175B3; Mon, 18 May 2026 17:40:30 +0200 (CEST) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Mon, 18 May 2026 17:39:55 +0200 Message-Id: To: =?utf-8?q?Fabian_Gr=C3=BCnbichler?= , =?utf-8?q?Michael_K=C3=B6ppl?= , Subject: Re: [PATCH cluster v3 4/8] add functions to determine warning level for high token timeouts From: =?utf-8?q?Michael_K=C3=B6ppl?= X-Mailer: aerc 0.21.0 References: <20260427170548.307698-1-m.koeppl@proxmox.com> <20260427170548.307698-5-m.koeppl@proxmox.com> <1779112787.r3e8af5igi.astroid@yuna.none> In-Reply-To: <1779112787.r3e8af5igi.astroid@yuna.none> X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1779118782912 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.094 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [corosync.pm] Message-ID-Hash: B5YMPVGHWDY5RMBK7QTBX7YX5C6CNVLF X-Message-ID-Hash: B5YMPVGHWDY5RMBK7QTBX7YX5C6CNVLF X-MailFrom: m.koeppl@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon May 18, 2026 at 4:11 PM CEST, Fabian Gr=C3=BCnbichler wrote: > On April 27, 2026 7:05 pm, Michael K=C3=B6ppl wrote: >> High token timeouts can lead to stability problems in clusters. To >> inform users about the timeout in their current setup (or expected >> timeouts when adding nodes) and give recommendations regarding the token >> coefficient setting, introduce function to calculate the timeout as well >> as determine the warning / recommendation levels. >>=20 >> Signed-off-by: Michael K=C3=B6ppl >> --- >> src/PVE/Corosync.pm | 50 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 50 insertions(+) >>=20 >> diff --git a/src/PVE/Corosync.pm b/src/PVE/Corosync.pm >> index aef0d31..45a1f71 100644 >> --- a/src/PVE/Corosync.pm >> +++ b/src/PVE/Corosync.pm >> @@ -534,4 +534,54 @@ sub resolve_hostname_like_corosync { >> return $match_ip_and_version->($resolved_ip); >> } >> =20 >> +sub calculate_membership_recovery_timeout { >> + my ($totemcfg, $node_count) =3D @_; >> + >> + my $token_timeout =3D $totemcfg->{token} // 3000; >> + my $token_coefficient =3D $totemcfg->{token_coefficient} // 650; >> + >> + my $expected_token_timeout =3D $token_timeout; >> + if ($node_count > 2) { >> + $expected_token_timeout +=3D ($node_count - 2) * $token_coeffic= ient; >> + } >> + >> + my $expected_consensus_timeout =3D $totemcfg->{consensus} // $expec= ted_token_timeout * 1.2; >> + return ($expected_token_timeout + $expected_consensus_timeout) / 10= 00.0; > > we could also ask corosync (via corosync-cmapctl) about most of these, > to avoid duplicating the calculations/defaults. the only thing missing > is the coefficient, though we could probably expose that on the corosync > side as well. Thanks for having a look at this! In the original implementation I used the values from cmap directly. The reason I decided to implement it like this later on was that I wanted to be able to calculate the timeout for an arbitrary number of nodes (although n and n+1 would suffice) to be able to display a warning before adding another node if the timeout would increase to a "problematic" level. I suppose using the values from corosync-cmapctl and then adding $node_delta * $token_coefficient to the token timeout would work, but apart from the avoiding duplicating the defaults, I'm not sure this would improve the solution much? Or am I missing something here? > >> +} >> + >> +sub get_membership_recovery_timeout_warning_level { >> + my ($total_timeout_secs) =3D @_; >> + [snip] >> + my $level_msg; >> + if ($level eq 'change-strongly-recommended') { >> + $level_msg =3D "Lowering the token coefficient is strongly reco= mmended"; >> + } elsif ($level eq 'change-recommended') { >> + $level_msg =3D "Lowering the token coefficient is recommended"; >> + } elsif ($level eq 'optimize') { >> + $level_msg =3D "The token coefficient can be optimized"; >> + } >> + >> + return >> + "Sum of Corosync token and consensus timeout is ${total_timeout= _secs}s. " >> + . "$level_msg. " >> + . "See 'man pvecm' for details."; > > this pretty much duplicates the frontend code - if we leave out the last > line we could just return the warning message, and call the field in the > API return value "totem_warning(s)" or "health_warnings" or just > "warnings" and potentially add more information in the future? we could > still keep the level and return > > warnings =3D [=20 > level =3D> ..., > msg =3D> ..., > ] > > but I don't currently see a reason why we'd benefit from returning raw > values and constructing the warning message on both ends? The messages themselves differ because one warning message is for the current state, whereas the other is for what would happen if another node was added to the cluster, but I agree that it's unnecessarily duplicated. We could instead return the warning message as totem_warnings, as you suggested, but offer different warning messages depending on a $node_delta (+ how many nodes to the current state, which will pretty much be 1 for all cases right now)? > >> +} >> + >> 1; >> --=20 >> 2.47.3 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20