From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 63BC51FF149 for ; Sat, 30 May 2026 01:40:35 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id DEF411B97; Sat, 30 May 2026 01:40:33 +0200 (CEST) Message-ID: <683ab04a-a2e6-4f29-9eb4-21b0f5464879@proxmox.com> Date: Sat, 30 May 2026 01:40:29 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Beta Subject: Re: [PATCH datacenter-manager 2/4] server: remote cache: prepare for back-off mechanism To: Dominik Csapak , pdm-devel@lists.proxmox.com References: <20260529133026.3149896-1-d.csapak@proxmox.com> <20260529133026.3149896-3-d.csapak@proxmox.com> Content-Language: en-US From: Thomas Lamprecht In-Reply-To: <20260529133026.3149896-3-d.csapak@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1780097999853 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.005 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: NTZEKUVWVWDOAFYCD2N4PLBOHNB7BMAM X-Message-ID-Hash: NTZEKUVWVWDOAFYCD2N4PLBOHNB7BMAM X-MailFrom: t.lamprecht@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Datacenter Manager development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Am 29.05.26 um 15:30 schrieb Dominik Csapak: > this introduces a new field for the RemoteMappingCache that contains the > current status of a 'BackOffState'. This is intended to mark remotes as > unreachable when the connection to them fails and only to retry if > enough time elapsed. This is to prevent sending numerous connections out > to a remote that is known to not be reachable. > > The back-off timeout is increased exponentially from 10 seconds up to > 600 seconds, so at most it takes 10 minutes for a remote to be reachable > again if it was offline for a prolonged period of time. I'd prefer if we only start this backoff after a remote having been failing already for a little while (10m to 1h) and capping rechecking still at a relatively low period, like 1m to 5min. That already cuts the checks down quite a bit and still keeps it responsive. One idea might be to also scale this depending on other remotes being configured and online, i.e. if all are offline use the first for a heuristic more frequent polling and reset backoff if that turns back online. Or if only a subset of nodes of a single remote are offline given them a higher delay. This fine tuning could provide quite a bit better UX, but might need some thoughts to encapsulate it right to avoid having those edge cases leak all over the place into the code to handled. What I'm generally missing: - including some more details like the current check periods without this in the commit message - actually logging when we back off so that this can be traced and the behavior understood by an admin (might be here or in 3/4, just mentioning it in general). > > Signed-off-by: Dominik Csapak > --- > Note that this now takes up to 10 minutes for pdm to mark a remote as > reachable again, since it won't retry sooner. We could combat that by > e.g. retrying every 10th connection, even if the back-off timeout has > not run out yet. (probably has to be scaled by the nodes and tasks > we are running?). Another possibility would be to have either a special > API call to force refresh it, but my guess is that most users would > just abuse that button? > > I'm very open for other ideas on how to improve this, maybe it's just > a matter of finetuning the back-off scale and maximum to get a well > working system.