From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pdm-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 63BC51FF149
	for <inbox@lore.proxmox.com>; Sat, 30 May 2026 01:40:35 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id DEF411B97;
	Sat, 30 May 2026 01:40:33 +0200 (CEST)
Message-ID: <683ab04a-a2e6-4f29-9eb4-21b0f5464879@proxmox.com>
Date: Sat, 30 May 2026 01:40:29 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird Beta
Subject: Re: [PATCH datacenter-manager 2/4] server: remote cache: prepare for
 back-off mechanism
To: Dominik Csapak <d.csapak@proxmox.com>, pdm-devel@lists.proxmox.com
References: <20260529133026.3149896-1-d.csapak@proxmox.com>
 <20260529133026.3149896-3-d.csapak@proxmox.com>
Content-Language: en-US
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
In-Reply-To: <20260529133026.3149896-3-d.csapak@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1780097999853
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.005 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: NTZEKUVWVWDOAFYCD2N4PLBOHNB7BMAM
X-Message-ID-Hash: NTZEKUVWVWDOAFYCD2N4PLBOHNB7BMAM
X-MailFrom: t.lamprecht@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox Datacenter Manager development discussion
 <pdm-devel.lists.proxmox.com>
List-Help: <mailto:pdm-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pdm-devel-owner@lists.proxmox.com>
List-Post: <mailto:pdm-devel@lists.proxmox.com>
List-Subscribe: <mailto:pdm-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pdm-devel-leave@lists.proxmox.com>

Am 29.05.26 um 15:30 schrieb Dominik Csapak:
> this introduces a new field for the RemoteMappingCache that contains the
> current status of a 'BackOffState'. This is intended to mark remotes as
> unreachable when the connection to them fails and only to retry if
> enough time elapsed. This is to prevent sending numerous connections out
> to a remote that is known to not be reachable.
> 
> The back-off timeout is increased exponentially from 10 seconds up to
> 600 seconds, so at most it takes 10 minutes for a remote to be reachable
> again if it was offline for a prolonged period of time.

I'd prefer if we only start this backoff after a remote having been
failing already for a little while (10m to 1h) and capping rechecking
still at a relatively low period, like 1m to 5min. That already cuts
the checks down quite a bit and still keeps it responsive. One idea
might be to also scale this depending on other remotes being configured
and online, i.e. if all are offline use the first for a heuristic more
frequent polling and reset backoff if that turns back online. Or if
only a subset of nodes of a single remote are offline given them a
higher delay. This fine tuning could provide quite a bit better UX,
but might need some thoughts to encapsulate it right to avoid having
those edge cases leak all over the place into the code to handled.

What I'm generally missing:
- including some more details like the current check periods without
  this in the commit message
- actually logging when we back off so that this can be traced and the
  behavior understood by an admin (might be here or in 3/4, just mentioning
  it in general).


> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> Note that this now takes up to 10 minutes for pdm to mark a remote as
> reachable again, since it won't retry sooner. We could combat that by
> e.g. retrying every 10th connection, even if the back-off timeout has
> not run out yet. (probably has to be scaled by the nodes and tasks
> we are running?). Another possibility would be to have either a special
> API call to force refresh it, but my guess is that most users would
> just abuse that button?
> 
> I'm very open for other ideas on how to improve this, maybe it's just
> a matter of finetuning the back-off scale and maximum to get a well
> working system.