public inbox for pdm-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Dominik Csapak <d.csapak@proxmox.com>
To: pdm-devel@lists.proxmox.com
Subject: [PATCH datacenter-manager 2/4] server: remote cache: prepare for back-off mechanism
Date: Fri, 29 May 2026 15:30:18 +0200	[thread overview]
Message-ID: <20260529133026.3149896-3-d.csapak@proxmox.com> (raw)
In-Reply-To: <20260529133026.3149896-1-d.csapak@proxmox.com>

this introduces a new field for the RemoteMappingCache that contains the
current status of a 'BackOffState'. This is intended to mark remotes as
unreachable when the connection to them fails and only to retry if
enough time elapsed. This is to prevent sending numerous connections out
to a remote that is known to not be reachable.

The back-off timeout is increased exponentially from 10 seconds up to
600 seconds, so at most it takes 10 minutes for a remote to be reachable
again if it was offline for a prolonged period of time.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
Note that this now takes up to 10 minutes for pdm to mark a remote as
reachable again, since it won't retry sooner. We could combat that by
e.g. retrying every 10th connection, even if the back-off timeout has
not run out yet. (probably has to be scaled by the nodes and tasks
we are running?). Another possibility would be to have either a special
API call to force refresh it, but my guess is that most users would
just abuse that button?

I'm very open for other ideas on how to improve this, maybe it's just
a matter of finetuning the back-off scale and maximum to get a well
working system.

 server/src/remote_cache/mod.rs | 139 +++++++++++++++++++++++++++++++++
 1 file changed, 139 insertions(+)

diff --git a/server/src/remote_cache/mod.rs b/server/src/remote_cache/mod.rs
index 39483e9f..d50a324c 100644
--- a/server/src/remote_cache/mod.rs
+++ b/server/src/remote_cache/mod.rs
@@ -24,6 +24,7 @@ use serde::{Deserialize, Serialize};
 
 use proxmox_product_config::replace_config;
 use proxmox_product_config::{ApiLockGuard, open_api_lockfile};
+use proxmox_time::epoch_i64;
 
 use pdm_api_types::remotes::RemoteType;
 use pdm_config::ConfigVersionCache;
@@ -224,6 +225,84 @@ impl RemoteMappingCache {
         self.info_by_hostname(remote, hostname)
             .is_none_or(|info| info.reachable)
     }
+
+    /// Mark a remote as reachable.
+    pub fn mark_remote_reachable(&mut self, remote: &str, reachable: RemoteState) {
+        if let Some(remote) = self.remotes.get_mut(remote) {
+            remote.set_reachable(reachable);
+        }
+    }
+
+    /// Returns the remaining backoff time and optionally the last error we got.
+    pub fn remote_remaining_backoff_time(
+        &self,
+        remote: &str,
+        current_time: i64,
+    ) -> (i64, Option<String>) {
+        match self.remotes.get(remote) {
+            Some(remote) => match &remote.backoff {
+                Some(backoff) => (
+                    backoff.remaining_backoff_time(current_time),
+                    Some(backoff.last_error.clone()),
+                ),
+                None => (0, None),
+            },
+            None => (0, None), // no info about remote, are we allowed to try?
+        }
+    }
+}
+
+/// The initial backoff time for a remote in case it's offline.
+const BACK_OFF_BASE_TIME_S: i64 = 10;
+/// The maximum back-off time between retries.
+const BACK_OFF_MAX_TIME_S: i64 = 600;
+
+/// Holds the current state for backing off an offline remote
+#[derive(Clone, Deserialize, Serialize)]
+struct BackOffState {
+    /// The last time a connection was tried
+    last_try: i64,
+    /// How often the back-off time is doubled since the start, up to a maximum of
+    /// [BACK_OFF_MAX_TIME_S]
+    back_off_doubling_count: u32,
+    /// The last error message we got:
+    last_error: String,
+}
+
+impl BackOffState {
+    /// Creates a new backoff state
+    fn new(time: i64, error: String) -> Self {
+        Self {
+            last_try: time,
+            back_off_doubling_count: 0,
+            last_error: error,
+        }
+    }
+
+    /// Increases the backoff state when the remote is still unreachable.
+    fn increase(&mut self, time: i64, error: String) {
+        if self.remaining_backoff_time(time) < BACK_OFF_MAX_TIME_S {
+            self.back_off_doubling_count += 1;
+        }
+        self.last_try = time;
+        self.last_error = error;
+    }
+
+    /// Returns the remaining time to try again in seconds.
+    /// If negative, we're free to try again.
+    fn remaining_backoff_time(&self, current_time: i64) -> i64 {
+        let back_off_time = (BACK_OFF_BASE_TIME_S * 2i64.pow(self.back_off_doubling_count))
+            .min(BACK_OFF_MAX_TIME_S);
+        self.last_try + back_off_time - current_time
+    }
+}
+
+/// If a remote is reachable or not
+pub enum RemoteState {
+    /// The remote is reachable
+    Reachable,
+    /// The remote is not reachable. Contains the error.
+    Unreachable(String),
 }
 
 /// An entry for a remote in a [`RemoteMappingCache`].
@@ -237,6 +316,9 @@ pub struct RemoteMapping {
 
     /// Maps a node name to a hostname, for where we have that info.
     pub node_to_host: HashMap<String, String>,
+
+    /// internal backoff state, only controlled via [Self::set_reachable]
+    backoff: Option<BackOffState>,
 }
 
 impl RemoteMapping {
@@ -245,6 +327,7 @@ impl RemoteMapping {
             ty,
             hosts: HashMap::new(),
             node_to_host: HashMap::new(),
+            backoff: None,
         }
     }
 
@@ -260,6 +343,22 @@ impl RemoteMapping {
             }
         }
     }
+
+    /// Sets the remote to reachable or unreachable
+    pub fn set_reachable(&mut self, reachable: RemoteState) {
+        match reachable {
+            RemoteState::Reachable => {
+                self.backoff = None;
+            }
+            RemoteState::Unreachable(err) => {
+                let time = epoch_i64();
+                match &mut self.backoff {
+                    Some(backoff) => backoff.increase(time, err),
+                    None => self.backoff = Some(BackOffState::new(time, err)),
+                }
+            }
+        }
+    }
 }
 
 /// All the data we keep cached for nodes found in [`RemoteMapping`].
@@ -289,3 +388,43 @@ impl HostInfo {
         self.node_name.as_deref()
     }
 }
+
+#[cfg(test)]
+mod test {
+    use crate::remote_cache::BackOffState;
+
+    #[test]
+    fn test_back_off_calculation() {
+        let mut backoff = BackOffState::new(0, String::new());
+        // timeout should be @ 10 seconds
+
+        assert_eq!(backoff.remaining_backoff_time(1), 9);
+        assert_eq!(backoff.remaining_backoff_time(5), 5);
+        assert_eq!(backoff.remaining_backoff_time(9), 1);
+        assert_eq!(backoff.remaining_backoff_time(10), 0);
+        assert_eq!(backoff.remaining_backoff_time(15), -5);
+
+        backoff.increase(0, String::new());
+        // timeout should be now @ 2*10 seconds
+
+        assert_eq!(backoff.remaining_backoff_time(10), 10);
+        assert_eq!(backoff.remaining_backoff_time(15), 5);
+        assert_eq!(backoff.remaining_backoff_time(20), 0);
+        assert_eq!(backoff.remaining_backoff_time(30), -10);
+
+        backoff.increase(0, String::new());
+        backoff.increase(0, String::new());
+        backoff.increase(0, String::new());
+
+        // timeout should be now at 2^4 * 10 seconds
+        assert_eq!(backoff.remaining_backoff_time(150), 10);
+
+        backoff.increase(0, String::new());
+        backoff.increase(0, String::new());
+        backoff.increase(0, String::new());
+        backoff.increase(0, String::new());
+
+        // timeout should be now at 2^8 * 10 seconds -> maximum of 600
+        assert_eq!(backoff.remaining_backoff_time(590), 10);
+    }
+}
-- 
2.47.3





  parent reply	other threads:[~2026-05-29 13:31 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-29 13:30 [PATCH datacenter-manager 0/4] implement back-off mechanism for Dominik Csapak
2026-05-29 13:30 ` [RFC PATCH datacenter-manager 1/4] server: connection: multi client: use correct client error for retrying Dominik Csapak
2026-05-29 23:25   ` Thomas Lamprecht
2026-05-29 13:30 ` Dominik Csapak [this message]
2026-05-29 23:40   ` [PATCH datacenter-manager 2/4] server: remote cache: prepare for back-off mechanism Thomas Lamprecht
2026-05-29 13:30 ` [PATCH datacenter-manager 3/4] server: connection: multi-client: use back-off state from remote cache Dominik Csapak
2026-05-29 13:30 ` [PATCH datacenter-manager 4/4] server: pbs client: rework to use the back-off mechanism " Dominik Csapak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260529133026.3149896-3-d.csapak@proxmox.com \
    --to=d.csapak@proxmox.com \
    --cc=pdm-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal