From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 44DC51FF141
	for <inbox@lore.proxmox.com>; Tue, 19 May 2026 13:40:19 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 0B8A24333;
	Tue, 19 May 2026 13:40:16 +0200 (CEST)
Content-Type: text/plain; charset=UTF-8
Date: Tue, 19 May 2026 13:40:08 +0200
Message-Id: <DIMMI5VBSP9X.3EFRKJWE8109O@proxmox.com>
From: =?utf-8?q?Michael_K=C3=B6ppl?= <m.koeppl@proxmox.com>
To: =?utf-8?q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com>,
 =?utf-8?q?Michael_K=C3=B6ppl?= <m.koeppl@proxmox.com>,
 <pve-devel@lists.proxmox.com>
Subject: Re: [PATCH cluster v3 4/8] add functions to determine warning level
 for high token timeouts
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0
X-Mailer: aerc 0.21.0
References: <20260427170548.307698-1-m.koeppl@proxmox.com>
 <20260427170548.307698-5-m.koeppl@proxmox.com>
 <1779112787.r3e8af5igi.astroid@yuna.none>
 <DILWZ7BMBJ9S.30FLMDBI8UAGT@proxmox.com>
 <1779173481.jqlsj6voe5.astroid@yuna.none>
In-Reply-To: <1779173481.jqlsj6voe5.astroid@yuna.none>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1779190795237
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.094 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: YB7IWMZWF5YZQAUMREDQ62P2ALTLVKAM
X-Message-ID-Hash: YB7IWMZWF5YZQAUMREDQ62P2ALTLVKAM
X-MailFrom: m.koeppl@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

On Tue May 19, 2026 at 8:59 AM CEST, Fabian Gr=C3=BCnbichler wrote:
> On May 18, 2026 5:39 pm, Michael K=C3=B6ppl wrote:
>> On Mon May 18, 2026 at 4:11 PM CEST, Fabian Gr=C3=BCnbichler wrote:
>>> On April 27, 2026 7:05 pm, Michael K=C3=B6ppl wrote:
>>>> High token timeouts can lead to stability problems in clusters. To
>>>> inform users about the timeout in their current setup (or expected
>>>> timeouts when adding nodes) and give recommendations regarding the tok=
en
>>>> coefficient setting, introduce function to calculate the timeout as we=
ll
>>>> as determine the warning / recommendation levels.
>>>>=20
>>>> Signed-off-by: Michael K=C3=B6ppl <m.koeppl@proxmox.com>
>>>> ---
>>>>  src/PVE/Corosync.pm | 50 ++++++++++++++++++++++++++++++++++++++++++++=
+
>>>>  1 file changed, 50 insertions(+)
>>>>=20
>>>> diff --git a/src/PVE/Corosync.pm b/src/PVE/Corosync.pm
>>>> index aef0d31..45a1f71 100644
>>>> --- a/src/PVE/Corosync.pm
>>>> +++ b/src/PVE/Corosync.pm
>>>> @@ -534,4 +534,54 @@ sub resolve_hostname_like_corosync {
>>>>      return $match_ip_and_version->($resolved_ip);
>>>>  }
>>>> =20
>>>> +sub calculate_membership_recovery_timeout {
>>>> +    my ($totemcfg, $node_count) =3D @_;
>>>> +
>>>> +    my $token_timeout =3D $totemcfg->{token} // 3000;
>>>> +    my $token_coefficient =3D $totemcfg->{token_coefficient} // 650;
>>>> +
>>>> +    my $expected_token_timeout =3D $token_timeout;
>>>> +    if ($node_count > 2) {
>>>> +        $expected_token_timeout +=3D ($node_count - 2) * $token_coeff=
icient;
>>>> +    }
>>>> +
>>>> +    my $expected_consensus_timeout =3D $totemcfg->{consensus} // $exp=
ected_token_timeout * 1.2;
>>>> +    return ($expected_token_timeout + $expected_consensus_timeout) / =
1000.0;
>>>
>>> we could also ask corosync (via corosync-cmapctl) about most of these,
>>> to avoid duplicating the calculations/defaults. the only thing missing
>>> is the coefficient, though we could probably expose that on the corosyn=
c
>>> side as well.
>>=20
>> Thanks for having a look at this!
>>=20
>> In the original implementation I used the values from cmap directly. The
>> reason I decided to implement it like this later on was that I wanted to
>> be able to calculate the timeout for an arbitrary number of nodes
>> (although n and n+1 would suffice) to be able to display a warning
>> before adding another node if the timeout would increase to a
>> "problematic" level. I suppose using the values from corosync-cmapctl
>> and then adding $node_delta * $token_coefficient to the token timeout
>> would work, but apart from the avoiding duplicating the defaults, I'm
>> not sure this would improve the solution much? Or am I missing
>> something here?
>
> if corosync ever changes its calculation or defaults, the current
> approach is bad ;)
>
> of course, that also still applies if we get the current value and the
> coefficient from corosync, in case it is the formula that changes..

I agree that getting the values from Corosync directly makes more sense
to avoid future divergence between what our implementation looks like
and what Corosync does, at least if we only want to calculate the
current value of the timeout. Given you suggested below that it would
probably make more sense to have warnings only for the current value,
we could do something like:

```perl
sub calculate_membership_recovery_timeout {
    my $cmap =3D read_cmap();
    return undef if !$cmap;

    my $token =3D $cmap->{'runtime.config.totem.token'};
    my $consensus =3D $cmap->{'runtime.config.totem.consensus'};
    return undef if !defined($token) || !defined($consensus);

    return ($token + $consensus) / 1000.0;
}
```

with read_cmap parsing the output of corosync-cmapctl. This could still
be extended to calculate it for an additional node if we wanted to in
the future.

>
>>>> +}
>>>> +
>>>> +sub get_membership_recovery_timeout_warning_level {
>>>> +    my ($total_timeout_secs) =3D @_;
>>>> +
>>=20
>> [snip]
>>=20
>>>> +    my $level_msg;
>>>> +    if ($level eq 'change-strongly-recommended') {
>>>> +        $level_msg =3D "Lowering the token coefficient is strongly re=
commended";
>>>> +    } elsif ($level eq 'change-recommended') {
>>>> +        $level_msg =3D "Lowering the token coefficient is recommended=
";
>>>> +    } elsif ($level eq 'optimize') {
>>>> +        $level_msg =3D "The token coefficient can be optimized";
>>>> +    }
>>>> +
>>>> +    return
>>>> +        "Sum of Corosync token and consensus timeout is ${total_timeo=
ut_secs}s. "
>>>> +        . "$level_msg. "
>>>> +        . "See 'man pvecm' for details.";
>>>
>>> this pretty much duplicates the frontend code - if we leave out the las=
t
>>> line we could just return the warning message, and call the field in th=
e
>>> API return value "totem_warning(s)" or "health_warnings" or just
>>> "warnings" and potentially add more information in the future? we could
>>> still keep the level and return
>>>
>>> warnings =3D [=20
>>>  level =3D> ...,
>>>  msg =3D> ...,
>>> ]
>>>
>>> but I don't currently see a reason why we'd benefit from returning raw
>>> values and constructing the warning message on both ends?
>>=20
>> The messages themselves differ because one warning message is for the
>> current state, whereas the other is for what would happen if another
>> node was added to the cluster, but I agree that it's unnecessarily
>> duplicated. We could instead return the warning message as
>> totem_warnings, as you suggested, but offer different warning messages
>> depending on a $node_delta (+ how many nodes to the current state, which
>> will pretty much be 1 for all cases right now)?
>
> yeah, I also wondered whether we should just have a boolean flag to
> determine whether we want the current value or the one for if one node
> were added to the current setup.. but in the end it doesn't make that
> much of a difference, unless the user for some reason set a very large
> coefficient manually?
>
> for small clusters, we should be below the thresholds anyway, and one
> more node doesn't matter. for big clusters, a single node being added
> with the default settings would add 0.65 * 2.2 =3D 1.43 seconds to the
> total timeout. the gaps between the warning levels are way bigger than
> that, so maybe just checking the current value is enough anyhow?
>
> if the user for some reason has a huge token timeout or coefficient or
> consensus timeout configured manually, they will most likely already be
> in a warning state anyway.. and with the default settings, joining would
> at most bump them from slightly below a warning level into that warning
> level, it's not like we can jump from "everything fine" to "strongly
> recommended" with a single node addition..

Agreed, would work for me to adapt it such that we only had the single
warning for the current value. Since @Friedrich and I discussed this
off-list initially and this was the primary reason why I implemented it
this way, maybe he has some input here as well, but I'll prepare a v4
using the values from corosync-cmapctl and with a single warning. Then
we could probably omit the warning level as discussed above and simply
return a `totem_warning` string as part of the response and print that,
appending the "See <documentation> for details" part separately for the
web UI and CLI.

>
> it might make more sense to warn about custom values for each of those
> three that are above certain thresholds, in addition to the total
> timeout checks implemented by this series?

[snip]