From: Dominik Csapak <d.csapak@proxmox.com>
To: pdm-devel@lists.proxmox.com
Subject: [PATCH datacenter-manager v2 2/4] server: remote cache: introduce canary remote when none is reachable
Date: Mon, 8 Jun 2026 15:25:30 +0200 [thread overview]
Message-ID: <20260608132539.2949407-3-d.csapak@proxmox.com> (raw)
In-Reply-To: <20260608132539.2949407-1-d.csapak@proxmox.com>
In case no remote is reachable, assume that PDM's network connection is
faulty (e.g. because of hardware failure or wrong configuration). In
that case, it makes no sense having each remote use a long back-off
mechanism until each is online again. Instead, select the last marked
offline remote as canary, which circumvents the back-off mechanism,
while all other remotes continue to use it.
As soon as any host of any remote is marked reachable again, reset all
back-off states for all remotes, so PDM can retry them as soon as it
needs to.
In case there is only a single remote, the back-off mechanism will not
be engaged, since there is no real way to distinguish a failed remote or
non-working PDM network.
A remote is reachable when any of its hosts is reachable.
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
server/src/remote_cache/mod.rs | 96 +++++++++++++++++++++++++++++++++-
1 file changed, 95 insertions(+), 1 deletion(-)
diff --git a/server/src/remote_cache/mod.rs b/server/src/remote_cache/mod.rs
index b6c2f48a..3263705f 100644
--- a/server/src/remote_cache/mod.rs
+++ b/server/src/remote_cache/mod.rs
@@ -127,6 +127,14 @@ impl WriteRemoteMappingCache {
pub struct RemoteMappingCache {
/// This maps a remote name to its mapping.
pub remotes: HashMap<String, RemoteMapping>,
+
+ /// A remote that is designated canary for which the back-off rules are not applied.
+ /// This is used in case all remotes are marked as offline, so we have a single remote
+ /// that is queried more often than the others.
+ ///
+ /// Used to detect total network failure (and restoration) on the PDM side.
+ #[serde(default, skip_serializing_if = "Option::is_none")]
+ canary_remote: Option<String>,
}
impl RemoteMappingCache {
@@ -201,6 +209,25 @@ impl RemoteMappingCache {
self.remotes.get_mut(remote)?.hosts.get_mut(hostname)
}
+ // checks to see if a canary is needed and sets it,
+ // and checks if we can reset all back-off states
+ fn set_or_reset_canary(&mut self, remote_name: &str, unreachable: bool) {
+ // if all remotes are marked offline, use this last one as canary
+ if unreachable && self.canary_is_needed() {
+ log::debug!("all remotes were marked unreachable, selecting {remote_name} as canary");
+ self.canary_remote = Some(remote_name.to_string());
+ }
+
+ // if we marked a host (and with it a remote) as reachable and we had a canary (meaning
+ // all remotes were offline at the same time) reset the whole back-off state of all remotes
+ if !unreachable && self.canary_remote.is_some() {
+ log::debug!(
+ "{remote_name} became reachable again after all were offline, resetting all back-off states"
+ );
+ self.reset_all_back_off_states();
+ }
+ }
+
/// Mark a host as reachable.
pub fn mark_host_reachable(
&mut self,
@@ -208,9 +235,13 @@ impl RemoteMappingCache {
hostname: &str,
connection_state: ConnectionState,
) {
+ let unreachable = matches!(&connection_state, ConnectionState::Unreachable(_));
+
if let Some(info) = self.info_by_hostname_mut(remote_name, hostname) {
info.set_reachable(connection_state);
}
+
+ self.set_or_reset_canary(remote_name, unreachable);
}
/// Mark a host as reachable.
@@ -220,9 +251,13 @@ impl RemoteMappingCache {
node_name: &str,
connection_state: ConnectionState,
) {
+ let unreachable = matches!(&connection_state, ConnectionState::Unreachable(_));
+
if let Some(info) = self.info_by_node_name_mut(remote_name, node_name) {
info.set_reachable(connection_state);
}
+
+ self.set_or_reset_canary(remote_name, unreachable);
}
/// Update the node name for a host, if the remote and host exist (otherwise this does
@@ -246,6 +281,11 @@ impl RemoteMappingCache {
hostname: &str,
current_time: i64,
) -> Option<(u64, String)> {
+ if let Some(canary) = &self.canary_remote {
+ if remote == canary {
+ return None;
+ }
+ }
self.info_by_hostname(remote, hostname)
.and_then(|info| info.back_off.as_ref())
.map(|back_off| {
@@ -256,12 +296,48 @@ impl RemoteMappingCache {
})
}
- /// Returns the remaining back-off time and optionally the last error we got.
+ // resets the back-off state of all hosts of all remotes. Used when a remote comes online again
+ // when none were reachable before
+ fn reset_all_back_off_states(&mut self) {
+ self.canary_remote = None;
+
+ for remote in self.remotes.values_mut() {
+ remote.reset_back_off();
+ }
+ }
+
+ // checks if a canary is needed: If none is set and all remotes are unreachable
+ fn canary_is_needed(&mut self) -> bool {
+ if let Some(canary) = &self.canary_remote {
+ if self.remotes.contains_key(canary) {
+ return false;
+ }
+
+ // the canary remote vanished from the cache, probably was de-configured
+ self.canary_remote = None;
+ }
+
+ for remote in self.remotes.values() {
+ if remote.is_reachable() {
+ return false;
+ }
+ }
+ true
+ }
+
+ /// Get the next time to try the remote and the last error if it was not reachable.
pub fn remote_time_to_next_try(
&self,
remote: &str,
current_time: i64,
) -> Option<(u64, String)> {
+ // We're the designated canary remote, so pretend we don't have back-off state
+ if let Some(canary) = &self.canary_remote {
+ if canary == remote {
+ return None;
+ }
+ }
+
match self.remotes.get(remote) {
Some(remote) => {
let mut time = u64::MAX;
@@ -334,6 +410,24 @@ impl RemoteMapping {
}
}
}
+
+ fn is_reachable(&self) -> bool {
+ if self.hosts.is_empty() {
+ return true;
+ }
+ for host in self.hosts.values() {
+ if host.is_reachable() {
+ return true;
+ }
+ }
+ false
+ }
+
+ fn reset_back_off(&mut self) {
+ for host in self.hosts.values_mut() {
+ host.set_reachable(ConnectionState::Reachable);
+ }
+ }
}
/// All the data we keep cached for nodes found in [`RemoteMapping`].
--
2.47.3
next prev parent reply other threads:[~2026-06-08 13:25 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-08 13:25 [PATCH datacenter-manager v2 0/4] implement back-off mechanism for connection errors for remotes Dominik Csapak
2026-06-08 13:25 ` [PATCH datacenter-manager v2 1/4] server: remote cache: prepare for back-off mechanism Dominik Csapak
2026-06-08 13:25 ` Dominik Csapak [this message]
2026-06-08 13:25 ` [PATCH datacenter-manager v2 3/4] server: connection: multi-client: use back-off state from remote cache Dominik Csapak
2026-06-08 13:25 ` [PATCH datacenter-manager v2 4/4] tasks: remote node mapping: use host cache for PBS too Dominik Csapak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260608132539.2949407-3-d.csapak@proxmox.com \
--to=d.csapak@proxmox.com \
--cc=pdm-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox