[PATCH datacenter-manager v2 0/4] implement back-off mechanism for connection errors for remotes

all lists on lists.proxmox.com
 help / color / mirror / Atom feed

* [PATCH datacenter-manager v2 0/4] implement back-off mechanism for connection errors for remotes
@ 2026-06-08 13:25 Dominik Csapak
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 1/4] server: remote cache: prepare for back-off mechanism Dominik Csapak
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Dominik Csapak @ 2026-06-08 13:25 UTC (permalink / raw)
  To: pdm-devel

When a remote is not reachable (e.g. network outage, crashes, etc), PDM
tries to connect on every attempt with a timeout. This leads to heavily
delayed api calls in the PDM UI. To counter that, this series implements
a basic back-off mechanism that increases the time between actual api
calls in an exponential way (up to a maximum).

For details on how the back-off mechanism works see patch 1/4

Possible Improvements/Future Work:
* We could expose the back-off values via a config (either global or per
  remote) to give the admin some fine grained control over this behavior
* There is still quite a bit of logs after this, but this can be cleaned
  up/improved upon later too.

changes from v1:
* rebased on master (dropped equivalent patches)
* rework most of the code

Dominik Csapak (4):
  server: remote cache: prepare for back-off mechanism
  server: remote cache: introduce canary remote when none is reachable
  server: connection: multi-client: use back-off state from remote cache
  tasks: remote node mapping: use host cache for PBS too

 .../tasks/remote_node_mapping.rs              |  34 ++-
 server/src/connection.rs                      | 118 +++++++---
 server/src/remote_cache/back_off.rs           | 128 ++++++++++
 server/src/remote_cache/mod.rs                | 222 +++++++++++++++++-
 4 files changed, 439 insertions(+), 63 deletions(-)
 create mode 100644 server/src/remote_cache/back_off.rs

-- 
2.47.3

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH datacenter-manager v2 1/4] server: remote cache: prepare for back-off mechanism
  2026-06-08 13:25 [PATCH datacenter-manager v2 0/4] implement back-off mechanism for connection errors for remotes Dominik Csapak
@ 2026-06-08 13:25 ` Dominik Csapak
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 2/4] server: remote cache: introduce canary remote when none is reachable Dominik Csapak
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Dominik Csapak @ 2026-06-08 13:25 UTC (permalink / raw)
  To: pdm-devel

Replace the 'reachable' field in `HostInfo` of the `RemoteMappingCache`
with an `Option<BackOffState>` instead. `None` means it's reachable and
some state means it's not.

The back-off state itself contains information about when it was last
reachable and when it's time to try again.

The current logic of the back-off state is:

* Retry for 10 minutes before backing off
* Start with backing off for 10 seconds.
* Retrying after the current back-off time ran out adds a doubled
  back-off time (20s, 40s, etc.) in case the host is still offline.
* do that until a maximum back-off time of 3 minutes is reached.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 .../tasks/remote_node_mapping.rs              |  29 ++--
 server/src/connection.rs                      |  21 ++-
 server/src/remote_cache/back_off.rs           | 128 ++++++++++++++++++
 server/src/remote_cache/mod.rs                | 114 ++++++++++++++--
 4 files changed, 264 insertions(+), 28 deletions(-)
 create mode 100644 server/src/remote_cache/back_off.rs

diff --git a/server/src/bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs b/server/src/bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs
index 9e2ebf15..867e01cf 100644
--- a/server/src/bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs
+++ b/server/src/bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs
@@ -25,7 +25,7 @@ use proxmox_section_config::typed::SectionConfigData;
 
 use pdm_api_types::remotes::{Remote, RemoteType};
 
-use server::remote_cache::{self, RemoteMappingCache};
+use server::remote_cache::{self, ConnectionState, RemoteMappingCache};
 use server::task_utils;
 
 const CONFIG_POLL_INTERVAL: u64 = 60;
@@ -188,22 +188,21 @@ impl CachingTask {
             log::debug!("querying remote {:?} node {:?}", remote.id, node.hostname);
 
             // if the host is new, we need to query its name
-            let query_result = match query_node_name(remote, &node.hostname).await {
-                Ok(node_name) => Some(node_name),
-                Err(err) => {
-                    log::error!(
-                        "failed to query info for remote '{}' node '{}' - {err:?}",
-                        remote.id,
-                        node.hostname
-                    );
-                    None
-                }
-            };
+            let (query_result, connection_state) =
+                match query_node_name(remote, &node.hostname).await {
+                    Ok(node_name) => (Some(node_name), ConnectionState::Reachable),
+                    Err(err) => {
+                        log::error!(
+                            "failed to query info for remote '{}' node '{}' - {err:?}",
+                            remote.id,
+                            node.hostname
+                        );
+                        (None, ConnectionState::Unreachable(err.to_string()))
+                    }
+                };
 
             let mut cache = RemoteMappingCache::write()?;
-            if let Some(info) = cache.info_by_hostname_mut(&remote.id, &node.hostname) {
-                info.reachable = query_result.is_some();
-            }
+            cache.mark_host_reachable(&remote.id, &node.hostname, connection_state);
             if let Some(node_name) = query_result {
                 cache.set_node_name(&remote.id, &node.hostname, Some(node_name));
             }
diff --git a/server/src/connection.rs b/server/src/connection.rs
index 7ad1a5b9..91a51cbf 100644
--- a/server/src/connection.rs
+++ b/server/src/connection.rs
@@ -25,6 +25,7 @@ use pdm_api_types::remotes::{NodeUrl, Remote, RemoteType, TlsProbeOutcome};
 use pve_api_types::client::PveClientImpl;
 
 use crate::pbs_client::PbsClient;
+use crate::remote_cache::ConnectionState;
 
 static INSTANCE: OnceLock<Box<dyn ClientFactory + Send + Sync>> = OnceLock::new();
 
@@ -381,7 +382,7 @@ impl ClientFactory for DefaultClientFactory {
     ) -> Result<Arc<PveClient>, Error> {
         let cache = crate::remote_cache::RemoteMappingCache::get();
         match cache.info_by_node_name(&remote.id, node) {
-            Some(info) if info.reachable => {
+            Some(info) if info.is_reachable() => {
                 self.make_pve_client_with_endpoint(remote, Some(&info.hostname))
             }
             _ => self.make_pve_client(remote),
@@ -550,7 +551,11 @@ impl MultiClientState {
             let entry = self.get_entry();
             log::error!("marking client {} as unreachable", entry.hostname);
             if let Ok(mut cache) = crate::remote_cache::RemoteMappingCache::write() {
-                cache.mark_host_reachable(&self.remote, &entry.hostname, false);
+                cache.mark_host_reachable(
+                    &self.remote,
+                    &entry.hostname,
+                    ConnectionState::Unreachable("unknown error".to_string()),
+                );
                 let _ = cache.save();
             }
             self.next();
@@ -775,7 +780,11 @@ macro_rules! try_request {
                             log::error!("marking {hostname:?} as reachable again!");
                             if let Ok(mut cache) = crate::remote_cache::RemoteMappingCache::write()
                             {
-                                cache.mark_host_reachable(&$self.remote, &hostname, true);
+                                cache.mark_host_reachable(
+                                    &$self.remote,
+                                    &hostname,
+                                    ConnectionState::Reachable,
+                                );
                                 let _ = cache.save();
                             }
                         }
@@ -797,7 +806,11 @@ macro_rules! try_request {
                 if let Ok(result) = tokio::time::timeout($self.timeout, request).await {
                     if result.is_ok() {
                         if let Ok(mut cache) = crate::remote_cache::RemoteMappingCache::write() {
-                            cache.mark_host_reachable(&$self.remote, &hostname, true);
+                            cache.mark_host_reachable(
+                                &$self.remote,
+                                &hostname,
+                                ConnectionState::Reachable,
+                            );
                             let _ = cache.save();
                         }
                     }
diff --git a/server/src/remote_cache/back_off.rs b/server/src/remote_cache/back_off.rs
new file mode 100644
index 00000000..b2eb7fcd
--- /dev/null
+++ b/server/src/remote_cache/back_off.rs
@@ -0,0 +1,128 @@
+use serde::{Deserialize, Serialize};
+
+/// The initial back-off time when the minimum offline time was reached.
+const BACK_OFF_BASE_TIME_S: i64 = 10;
+/// The maximum back-off time between retries.
+const BACK_OFF_MAX_TIME_S: i64 = 3 * 60;
+/// Minimum time a connection must fail before back-off kicks in.
+const BACK_OFF_MIN_TIME_S: i64 = 10 * 60;
+
+/// Holds the current state for backing off an offline remote
+#[derive(Clone, Deserialize, Serialize)]
+pub struct BackOffState {
+    /// How often the back-off time is doubled since the start, up to a maximum of
+    /// [BACK_OFF_MAX_TIME_S]
+    back_off_doubling_count: u32,
+    /// The last error message we got.
+    last_error: String,
+    /// The first time it failed.
+    first_fail_time: i64,
+    /// The next time a connection can be retried.
+    next_try: i64,
+}
+
+impl BackOffState {
+    /// Create a new back-off config when a host/remote is not reachable for the first time
+    pub fn new(time: i64, error: String) -> Self {
+        Self {
+            first_fail_time: time,
+            next_try: time,
+            back_off_doubling_count: 0,
+            last_error: error,
+        }
+    }
+
+    /// Increases the back-off state when the remote is still unreachable.
+    /// Returns the next timestamp when it's allowed to retry if set.
+    pub fn retried(&mut self, time: i64, error: String) -> Option<i64> {
+        self.last_error = error;
+
+        if self.time_to_next_try(time) > 0 {
+            // still in the back off period
+            return None;
+        }
+
+        if self.back_off_doubling_count == 0 && (time - self.first_fail_time) < BACK_OFF_MIN_TIME_S
+        {
+            // failing, but did not reach the minimum fail timer yet
+            return None;
+        }
+
+        // we're outside the back-off period, and later than the minimum
+        let back_off_time = (BACK_OFF_BASE_TIME_S * 2i64.pow(self.back_off_doubling_count))
+            .min(BACK_OFF_MAX_TIME_S);
+
+        if back_off_time < BACK_OFF_MAX_TIME_S {
+            self.back_off_doubling_count += 1;
+        }
+
+        self.next_try = time + back_off_time;
+        Some(self.next_try)
+    }
+
+    /// Returns the remaining time to try again in seconds.
+    /// If it is 0, we're free to try again.
+    pub fn time_to_next_try(&self, current_time: i64) -> u64 {
+        self.next_try.saturating_sub(current_time).max(0) as u64
+    }
+
+    /// Returns the last error that was set.
+    pub fn last_error(&self) -> String {
+        self.last_error.clone()
+    }
+}
+
+#[cfg(test)]
+mod test {
+    use super::{BACK_OFF_MAX_TIME_S, BACK_OFF_MIN_TIME_S, BackOffState};
+
+    #[test]
+    fn test_back_off_calculation() {
+        let mut time = 0;
+        let mut back_off = BackOffState::new(time, String::new());
+        // timeout should be @ 0 seconds
+
+        assert_eq!(back_off.time_to_next_try(time + 1), 0);
+
+        time += BACK_OFF_MIN_TIME_S;
+        back_off.retried(time, String::new());
+        let mut back_off_time = 10;
+
+        assert_eq!(back_off.time_to_next_try(time + 5), 5);
+        assert_eq!(back_off.time_to_next_try(time + 10), 0);
+        assert_eq!(back_off.time_to_next_try(time + 20), 0);
+
+        back_off.retried(time + back_off_time, String::new());
+        time += back_off_time;
+        back_off_time *= 2;
+        back_off.retried(time, String::new());
+
+        assert_eq!(back_off.time_to_next_try(time), back_off_time as u64);
+
+        time += back_off_time;
+        back_off_time *= 2;
+        back_off.retried(time, String::new());
+        time += back_off_time;
+        back_off_time *= 2;
+        back_off.retried(time, String::new());
+        time += back_off_time;
+        back_off_time *= 2;
+        back_off.retried(time, String::new());
+        time += back_off_time;
+        back_off.retried(time, String::new());
+
+        // retried 7 times, back-off should have reached the max time by now
+        assert_eq!(back_off.time_to_next_try(time), BACK_OFF_MAX_TIME_S as u64);
+
+        // simulate 12 hour outage
+
+        let iterations = 12 * 60 * 60 / BACK_OFF_MAX_TIME_S;
+        for _ in 0..iterations {
+            time += BACK_OFF_MAX_TIME_S;
+            back_off.retried(time, String::new());
+        }
+
+        // timeout should still be at the maximum
+        assert_eq!(back_off.time_to_next_try(time), BACK_OFF_MAX_TIME_S as u64);
+    }
+}
diff --git a/server/src/remote_cache/mod.rs b/server/src/remote_cache/mod.rs
index 39483e9f..b6c2f48a 100644
--- a/server/src/remote_cache/mod.rs
+++ b/server/src/remote_cache/mod.rs
@@ -24,10 +24,14 @@ use serde::{Deserialize, Serialize};
 
 use proxmox_product_config::replace_config;
 use proxmox_product_config::{ApiLockGuard, open_api_lockfile};
+use proxmox_time::epoch_i64;
 
 use pdm_api_types::remotes::RemoteType;
 use pdm_config::ConfigVersionCache;
 
+mod back_off;
+use back_off::BackOffState;
+
 const CACHE_FILENAME: &str = concat!(
     pdm_buildcfg::PDM_CACHE_DIR_M!(),
     "/remote-mapping-cache.json"
@@ -198,16 +202,26 @@ impl RemoteMappingCache {
     }
 
     /// Mark a host as reachable.
-    pub fn mark_host_reachable(&mut self, remote_name: &str, hostname: &str, reachable: bool) {
+    pub fn mark_host_reachable(
+        &mut self,
+        remote_name: &str,
+        hostname: &str,
+        connection_state: ConnectionState,
+    ) {
         if let Some(info) = self.info_by_hostname_mut(remote_name, hostname) {
-            info.reachable = reachable;
+            info.set_reachable(connection_state);
         }
     }
 
     /// Mark a host as reachable.
-    pub fn mark_node_reachable(&mut self, remote_name: &str, node_name: &str, reachable: bool) {
+    pub fn mark_node_reachable(
+        &mut self,
+        remote_name: &str,
+        node_name: &str,
+        connection_state: ConnectionState,
+    ) {
         if let Some(info) = self.info_by_node_name_mut(remote_name, node_name) {
-            info.reachable = reachable;
+            info.set_reachable(connection_state);
         }
     }
 
@@ -222,10 +236,70 @@ impl RemoteMappingCache {
     /// Check if a host is reachable.
     pub fn host_is_reachable(&self, remote: &str, hostname: &str) -> bool {
         self.info_by_hostname(remote, hostname)
-            .is_none_or(|info| info.reachable)
+            .is_none_or(|info| info.is_reachable())
+    }
+
+    /// Get the next time to try the host and the last error if it was not reachable.
+    pub fn host_time_to_next_try(
+        &self,
+        remote: &str,
+        hostname: &str,
+        current_time: i64,
+    ) -> Option<(u64, String)> {
+        self.info_by_hostname(remote, hostname)
+            .and_then(|info| info.back_off.as_ref())
+            .map(|back_off| {
+                (
+                    back_off.time_to_next_try(current_time),
+                    back_off.last_error(),
+                )
+            })
+    }
+
+    /// Returns the remaining back-off time and optionally the last error we got.
+    pub fn remote_time_to_next_try(
+        &self,
+        remote: &str,
+        current_time: i64,
+    ) -> Option<(u64, String)> {
+        match self.remotes.get(remote) {
+            Some(remote) => {
+                let mut time = u64::MAX;
+                let mut err = String::new();
+                for info in remote.hosts.values() {
+                    if let Some(back_off) = &info.back_off {
+                        let node_time = back_off.time_to_next_try(current_time);
+                        // use the least time from the hosts
+                        if node_time < time {
+                            time = node_time;
+                            err = back_off.last_error();
+                        }
+                    } else {
+                        // we found a node that is reachable, return immediately
+                        return None;
+                    }
+                }
+
+                if time == u64::MAX {
+                    // we had no node information so we're allowed to try
+                    return None;
+                }
+
+                Some((time, err))
+            }
+            None => None, // no info about remote, are we allowed to try
+        }
     }
 }
 
+/// If a remote is reachable or not
+pub enum ConnectionState {
+    /// The host/remote/etc. is reachable
+    Reachable,
+    /// The remote/host/etc. is not reachable. Contains the error.
+    Unreachable(String),
+}
+
 /// An entry for a remote in a [`RemoteMappingCache`].
 #[derive(Clone, Deserialize, Serialize)]
 pub struct RemoteMapping {
@@ -271,9 +345,9 @@ pub struct HostInfo {
     /// This is the cluster side node name, if we know it.
     node_name: Option<String>,
 
-    /// This means we were able to reach the node.
-    /// When a client fails to connect it may update this to mark it as unreachable.
-    pub reachable: bool,
+    /// Per host back off config
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    back_off: Option<BackOffState>,
 }
 
 impl HostInfo {
@@ -281,11 +355,33 @@ impl HostInfo {
         Self {
             hostname,
             node_name: None,
-            reachable: true,
+            back_off: None,
         }
     }
 
     pub fn node_name(&self) -> Option<&str> {
         self.node_name.as_deref()
     }
+
+    /// Sets the host's reachable status.
+    /// Returns the next timestamp when it's allowed to retry if set.
+    pub fn set_reachable(&mut self, connection_state: ConnectionState) -> Option<i64> {
+        match connection_state {
+            ConnectionState::Reachable => {
+                self.back_off = None;
+            }
+            ConnectionState::Unreachable(err) => {
+                let time = epoch_i64();
+                match &mut self.back_off {
+                    Some(back_off) => return back_off.retried(time, err),
+                    None => self.back_off = Some(BackOffState::new(time, err)),
+                }
+            }
+        }
+        None
+    }
+
+    pub fn is_reachable(&self) -> bool {
+        self.back_off.is_none()
+    }
 }
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH datacenter-manager v2 2/4] server: remote cache: introduce canary remote when none is reachable
  2026-06-08 13:25 [PATCH datacenter-manager v2 0/4] implement back-off mechanism for connection errors for remotes Dominik Csapak
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 1/4] server: remote cache: prepare for back-off mechanism Dominik Csapak
@ 2026-06-08 13:25 ` Dominik Csapak
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 3/4] server: connection: multi-client: use back-off state from remote cache Dominik Csapak
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 4/4] tasks: remote node mapping: use host cache for PBS too Dominik Csapak
  3 siblings, 0 replies; 5+ messages in thread
From: Dominik Csapak @ 2026-06-08 13:25 UTC (permalink / raw)
  To: pdm-devel

In case no remote is reachable, assume that PDM's network connection is
faulty (e.g. because of hardware failure or wrong configuration). In
that case, it makes no sense having each remote use a long back-off
mechanism until each is online again. Instead, select the last marked
offline remote as canary, which circumvents the back-off mechanism,
while all other remotes continue to use it.

As soon as any host of any remote is marked reachable again, reset all
back-off states for all remotes, so PDM can retry them as soon as it
needs to.

In case there is only a single remote, the back-off mechanism will not
be engaged, since there is no real way to distinguish a failed remote or
non-working PDM network.

A remote is reachable when any of its hosts is reachable.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 server/src/remote_cache/mod.rs | 96 +++++++++++++++++++++++++++++++++-
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/server/src/remote_cache/mod.rs b/server/src/remote_cache/mod.rs
index b6c2f48a..3263705f 100644
--- a/server/src/remote_cache/mod.rs
+++ b/server/src/remote_cache/mod.rs
@@ -127,6 +127,14 @@ impl WriteRemoteMappingCache {
 pub struct RemoteMappingCache {
     /// This maps a remote name to its mapping.
     pub remotes: HashMap<String, RemoteMapping>,
+
+    /// A remote that is designated canary for which the back-off rules are not applied.
+    /// This is used in case all remotes are marked as offline, so we have a single remote
+    /// that is queried more often than the others.
+    ///
+    /// Used to detect total network failure (and restoration) on the PDM side.
+    #[serde(default, skip_serializing_if = "Option::is_none")]
+    canary_remote: Option<String>,
 }
 
 impl RemoteMappingCache {
@@ -201,6 +209,25 @@ impl RemoteMappingCache {
         self.remotes.get_mut(remote)?.hosts.get_mut(hostname)
     }
 
+    // checks to see if a canary is needed and sets it,
+    // and checks if we can reset all back-off states
+    fn set_or_reset_canary(&mut self, remote_name: &str, unreachable: bool) {
+        // if all remotes are marked offline, use this last one as canary
+        if unreachable && self.canary_is_needed() {
+            log::debug!("all remotes were marked unreachable, selecting {remote_name} as canary");
+            self.canary_remote = Some(remote_name.to_string());
+        }
+
+        // if we marked a host (and with it a remote) as reachable and we had a canary (meaning
+        // all remotes were offline at the same time) reset the whole back-off state of all remotes
+        if !unreachable && self.canary_remote.is_some() {
+            log::debug!(
+                "{remote_name} became reachable again after all were offline, resetting all back-off states"
+            );
+            self.reset_all_back_off_states();
+        }
+    }
+
     /// Mark a host as reachable.
     pub fn mark_host_reachable(
         &mut self,
@@ -208,9 +235,13 @@ impl RemoteMappingCache {
         hostname: &str,
         connection_state: ConnectionState,
     ) {
+        let unreachable = matches!(&connection_state, ConnectionState::Unreachable(_));
+
         if let Some(info) = self.info_by_hostname_mut(remote_name, hostname) {
             info.set_reachable(connection_state);
         }
+
+        self.set_or_reset_canary(remote_name, unreachable);
     }
 
     /// Mark a host as reachable.
@@ -220,9 +251,13 @@ impl RemoteMappingCache {
         node_name: &str,
         connection_state: ConnectionState,
     ) {
+        let unreachable = matches!(&connection_state, ConnectionState::Unreachable(_));
+
         if let Some(info) = self.info_by_node_name_mut(remote_name, node_name) {
             info.set_reachable(connection_state);
         }
+
+        self.set_or_reset_canary(remote_name, unreachable);
     }
 
     /// Update the node name for a host, if the remote and host exist (otherwise this does
@@ -246,6 +281,11 @@ impl RemoteMappingCache {
         hostname: &str,
         current_time: i64,
     ) -> Option<(u64, String)> {
+        if let Some(canary) = &self.canary_remote {
+            if remote == canary {
+                return None;
+            }
+        }
         self.info_by_hostname(remote, hostname)
             .and_then(|info| info.back_off.as_ref())
             .map(|back_off| {
@@ -256,12 +296,48 @@ impl RemoteMappingCache {
             })
     }
 
-    /// Returns the remaining back-off time and optionally the last error we got.
+    // resets the back-off state of all hosts of all remotes. Used when a remote comes online again
+    // when none were reachable before
+    fn reset_all_back_off_states(&mut self) {
+        self.canary_remote = None;
+
+        for remote in self.remotes.values_mut() {
+            remote.reset_back_off();
+        }
+    }
+
+    // checks if a canary is needed: If none is set and all remotes are unreachable
+    fn canary_is_needed(&mut self) -> bool {
+        if let Some(canary) = &self.canary_remote {
+            if self.remotes.contains_key(canary) {
+                return false;
+            }
+
+            // the canary remote vanished from the cache, probably was de-configured
+            self.canary_remote = None;
+        }
+
+        for remote in self.remotes.values() {
+            if remote.is_reachable() {
+                return false;
+            }
+        }
+        true
+    }
+
+    /// Get the next time to try the remote and the last error if it was not reachable.
     pub fn remote_time_to_next_try(
         &self,
         remote: &str,
         current_time: i64,
     ) -> Option<(u64, String)> {
+        // We're the designated canary remote, so pretend we don't have back-off state
+        if let Some(canary) = &self.canary_remote {
+            if canary == remote {
+                return None;
+            }
+        }
+
         match self.remotes.get(remote) {
             Some(remote) => {
                 let mut time = u64::MAX;
@@ -334,6 +410,24 @@ impl RemoteMapping {
             }
         }
     }
+
+    fn is_reachable(&self) -> bool {
+        if self.hosts.is_empty() {
+            return true;
+        }
+        for host in self.hosts.values() {
+            if host.is_reachable() {
+                return true;
+            }
+        }
+        false
+    }
+
+    fn reset_back_off(&mut self) {
+        for host in self.hosts.values_mut() {
+            host.set_reachable(ConnectionState::Reachable);
+        }
+    }
 }
 
 /// All the data we keep cached for nodes found in [`RemoteMapping`].
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH datacenter-manager v2 3/4] server: connection: multi-client: use back-off state from remote cache
  2026-06-08 13:25 [PATCH datacenter-manager v2 0/4] implement back-off mechanism for connection errors for remotes Dominik Csapak
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 1/4] server: remote cache: prepare for back-off mechanism Dominik Csapak
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 2/4] server: remote cache: introduce canary remote when none is reachable Dominik Csapak
@ 2026-06-08 13:25 ` Dominik Csapak
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 4/4] tasks: remote node mapping: use host cache for PBS too Dominik Csapak
  3 siblings, 0 replies; 5+ messages in thread
From: Dominik Csapak @ 2026-06-08 13:25 UTC (permalink / raw)
  To: pdm-devel

This greatly reduces the wait time for api calls if a specific remote is
offline for a prolonged period of time. Instead of waiting for a timeout
on every api call to a remote which is known to be offline, return the
last error immediately.

Most notable change here is that the retry loop itself marks the hosts
as unreachable now instead of the iterator. With that, the 'failed()'
method of the `MultiClientState` was unused and can be removed. Adapt
all comments that noted this or mentioned the `failed()` method.

The unreachable clients are still skipped in the iterator.

If a host returns an error on an idempotent API call, we mark the host
as unreachable in case this was a transient error. Any other API call to
that host in the next 10 minutes that succeeds will reset the back-off
state again, so this should be fine.

When backing off, log it with a helpful error message.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 server/src/connection.rs       | 107 ++++++++++++++++++++++-----------
 server/src/remote_cache/mod.rs |  20 +++++-
 2 files changed, 89 insertions(+), 38 deletions(-)

diff --git a/server/src/connection.rs b/server/src/connection.rs
index 91a51cbf..cf0cfb8a 100644
--- a/server/src/connection.rs
+++ b/server/src/connection.rs
@@ -20,6 +20,7 @@ use serde::Serialize;
 
 use proxmox_acme_api::CertificateInfo;
 use proxmox_client::{Client, HttpApiClient, HttpApiResponse, HttpApiResponseStream, TlsOptions};
+use proxmox_time::epoch_i64;
 
 use pdm_api_types::remotes::{NodeUrl, Remote, RemoteType, TlsProbeOutcome};
 use pve_api_types::client::PveClientImpl;
@@ -540,29 +541,6 @@ impl MultiClientState {
         self.current = self.current.wrapping_add(1);
     }
 
-    /// Whenever a request fails with the *current* client we move the current entry forward.
-    ///
-    /// # Note:
-    ///
-    /// With our current strategy `failed_index` is always less than `current`, but if we change
-    /// the strategy, we may want to change this to something like `1 + max(current, failed)`.
-    fn failed(&mut self, failed_index: usize) {
-        if self.current == failed_index {
-            let entry = self.get_entry();
-            log::error!("marking client {} as unreachable", entry.hostname);
-            if let Ok(mut cache) = crate::remote_cache::RemoteMappingCache::write() {
-                cache.mark_host_reachable(
-                    &self.remote,
-                    &entry.hostname,
-                    ConnectionState::Unreachable("unknown error".to_string()),
-                );
-                let _ = cache.save();
-            }
-            self.next();
-            self.skip_unreachable();
-        }
-    }
-
     /// Skip ahead as long as we're pointing to an unreachable.
     fn skip_unreachable(&mut self) {
         let cache = crate::remote_cache::RemoteMappingCache::get();
@@ -588,8 +566,7 @@ impl MultiClientState {
         &self.entries[self.index()]
     }
 
-    /// Get the current entry and its index which can be passed to `failed()` if the client fails
-    /// to connect.
+    /// Get the current entry and its index.
     fn get(&self) -> (&MultiClientEntry, usize) {
         let index = self.index();
         (&self.entries[index], self.current)
@@ -656,8 +633,8 @@ impl MultiClient {
     ///
     /// This is basically a "generator" for clients to try.
     ///
-    /// We share the "state" with other tasks. When a client fails, it is "marked" as failed and
-    /// the state "rotates" through the clients.
+    /// We share the "state" with other tasks. When a client fails, it should be "marked" as failed
+    /// so that the state "rotates" through the clients.
     /// We might be skipping clients if other tasks already tried "more" clients, but that's fine,
     /// since there's no point in trying the same remote twice simultaneously if it is currently
     /// offline...
@@ -688,10 +665,14 @@ impl MultiClient {
                     Some(TryClient::reachable(client))
                 }
                 Some((start, current)) => {
-                    // If our last request failed, the retry-loop asks for another client, mark the
-                    // one we just used as failed and check if all clients have gone through a
-                    // retry loop...
-                    state.failed(current);
+                    // If our last request failed, the retry-loop asks for another client, it marked
+                    // the one we just used as failed and we check if all clients have gone through
+                    // a retry loop...
+                    if state.current == current {
+                        state.next();
+                        state.skip_unreachable();
+                    }
+
                     if state.tried_all_since(start) {
                         // This iterator (and therefore this retry-loop) has tried all clients.
                         // Give up.
@@ -701,8 +682,7 @@ impl MultiClient {
                             state.get_at(try_unreachable.as_mut()?.next()?),
                         ));
                     }
-                    // finally just get the new current client and update `current` for the later
-                    // call to `failed()`
+                    // finally just get the new current client and update `current`
                     let (client, new_current) = state.get();
                     start_current = Some((start, new_current));
 
@@ -729,6 +709,23 @@ macro_rules! try_request {
     ($self:expr, $method:expr, $path_and_query:expr, $params:expr, $how:ident) => {
         let params = $params.map(serde_json::to_value);
         Box::pin(async move {
+            // check if we're still in the back-off time frame of the remote
+            {
+                let cache = crate::remote_cache::RemoteMappingCache::get();
+                if let Some((back_off_time, last_error)) =
+                    cache.remote_time_to_next_try(&$self.remote, epoch_i64())
+                {
+                    if back_off_time > 0 {
+                        log::debug!(
+                            "wanted to retry '{}' but still {}s until retry time",
+                            $self.remote,
+                            back_off_time
+                        );
+                        return Err(proxmox_client::Error::Connect(last_error.into()));
+                    }
+                }
+            }
+
             let params = params
                 .transpose()
                 .map_err(|err| proxmox_client::Error::Anyhow(err.into()))?;
@@ -738,14 +735,31 @@ macro_rules! try_request {
             // remember the first endpoint to retry once at the end if it only failed to connect
             let mut connect_retry = None;
             let mut first = true;
-            // The iterator in use here will automatically mark a client as faulty if we move on to
-            // the `next()` one.
+            // If a call fails here, we have to mark the host as faulty.
             for TryClient {
                 client,
                 hostname,
                 reachable,
             } in $self.try_clients()
             {
+                // check if we're still in the back-off time frame of the host
+                {
+                    let cache = crate::remote_cache::RemoteMappingCache::get();
+                    if let Some((back_off_time, _)) =
+                        cache.host_time_to_next_try(&$self.remote, &hostname, epoch_i64())
+                    {
+                        if back_off_time > 0 {
+                            log::debug!(
+                                "wanted to retry '{}' - '{}', but still {}s until retry time",
+                                $self.remote,
+                                hostname,
+                                back_off_time
+                            );
+                            continue;
+                        }
+                    }
+                }
+
                 if let Some(err) = last_err.take() {
                     let path = $path_and_query;
                     log::error!("client error on request {path}, trying another remote - {err:?}");
@@ -795,6 +809,29 @@ macro_rules! try_request {
                     }
                 }
                 first = false;
+
+                // if this client was not able to reach the remote, mark the current hostname as
+                // unreachable
+                let err = match (timed_out, last_err.as_ref()) {
+                    (true, _) => Some("request timed out".to_string()),
+                    (false, Some(err)) => Some(err.to_string()),
+                    (false, None) => None,
+                };
+                if let Some(err) = err {
+                    if let Ok(mut cache) = crate::remote_cache::RemoteMappingCache::write() {
+                        log::debug!(
+                            "could not reach host '{}' on remote '{}', marking as unreachable",
+                            hostname,
+                            $self.remote
+                        );
+                        cache.mark_host_reachable(
+                            &$self.remote,
+                            &hostname,
+                            ConnectionState::Unreachable(err),
+                        );
+                        let _ = cache.save();
+                    }
+                }
             }
 
             // best-effort: give the first endpoint one more try after all others, but only on a
diff --git a/server/src/remote_cache/mod.rs b/server/src/remote_cache/mod.rs
index 3263705f..270118a9 100644
--- a/server/src/remote_cache/mod.rs
+++ b/server/src/remote_cache/mod.rs
@@ -24,7 +24,7 @@ use serde::{Deserialize, Serialize};
 
 use proxmox_product_config::replace_config;
 use proxmox_product_config::{ApiLockGuard, open_api_lockfile};
-use proxmox_time::epoch_i64;
+use proxmox_time::{epoch_i64, epoch_to_rfc2822};
 
 use pdm_api_types::remotes::RemoteType;
 use pdm_config::ConfigVersionCache;
@@ -238,7 +238,14 @@ impl RemoteMappingCache {
         let unreachable = matches!(&connection_state, ConnectionState::Unreachable(_));
 
         if let Some(info) = self.info_by_hostname_mut(remote_name, hostname) {
-            info.set_reachable(connection_state);
+            if let Some(next_try) = info.set_reachable(connection_state) {
+                log::warn!(
+                    "could not reach host {} of {} for some time, backing off until {}",
+                    hostname,
+                    remote_name,
+                    epoch_to_rfc2822(next_try).unwrap_or_else(|_| next_try.to_string())
+                );
+            }
         }
 
         self.set_or_reset_canary(remote_name, unreachable);
@@ -254,7 +261,14 @@ impl RemoteMappingCache {
         let unreachable = matches!(&connection_state, ConnectionState::Unreachable(_));
 
         if let Some(info) = self.info_by_node_name_mut(remote_name, node_name) {
-            info.set_reachable(connection_state);
+            if let Some(next_try) = info.set_reachable(connection_state) {
+                log::warn!(
+                    "could not reach node {} of {} for some time, backing off until {}",
+                    node_name,
+                    remote_name,
+                    epoch_to_rfc2822(next_try).unwrap_or_else(|_| next_try.to_string())
+                );
+            }
         }
 
         self.set_or_reset_canary(remote_name, unreachable);
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH datacenter-manager v2 4/4] tasks: remote node mapping: use host cache for PBS too
  2026-06-08 13:25 [PATCH datacenter-manager v2 0/4] implement back-off mechanism for connection errors for remotes Dominik Csapak
                   ` (2 preceding siblings ...)
  2026-06-08 13:25 ` [PATCH datacenter-manager v2 3/4] server: connection: multi-client: use back-off state from remote cache Dominik Csapak
@ 2026-06-08 13:25 ` Dominik Csapak
  3 siblings, 0 replies; 5+ messages in thread
From: Dominik Csapak @ 2026-06-08 13:25 UTC (permalink / raw)
  To: pdm-devel

since we want to leverage the back-off mechanism from the remote node
cache, we have to insert host information there for PBS too.

Omit the nodename querying for now, as there is no need to do that on
PBS remotes (there is only ever one node in a PBS remote).

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 .../bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs  | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/server/src/bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs b/server/src/bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs
index 867e01cf..6f4919a4 100644
--- a/server/src/bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs
+++ b/server/src/bin/proxmox-datacenter-api/tasks/remote_node_mapping.rs
@@ -140,11 +140,6 @@ impl CachingTask {
             *entry = remote_cache::RemoteMapping::new(remote.ty);
         }
 
-        // Only PVE entries currently have a node cache, so skip non-PVE remotes:
-        if remote.ty != RemoteType::Pve {
-            return;
-        }
-
         // prune nodes which were removed:
         entry.hosts.retain(|hostname, info| {
             let retain = remote.nodes.iter().any(|node| node.hostname == *hostname);
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-08 13:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 13:25 [PATCH datacenter-manager v2 0/4] implement back-off mechanism for connection errors for remotes Dominik Csapak
2026-06-08 13:25 ` [PATCH datacenter-manager v2 1/4] server: remote cache: prepare for back-off mechanism Dominik Csapak
2026-06-08 13:25 ` [PATCH datacenter-manager v2 2/4] server: remote cache: introduce canary remote when none is reachable Dominik Csapak
2026-06-08 13:25 ` [PATCH datacenter-manager v2 3/4] server: connection: multi-client: use back-off state from remote cache Dominik Csapak
2026-06-08 13:25 ` [PATCH datacenter-manager v2 4/4] tasks: remote node mapping: use host cache for PBS too Dominik Csapak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal