[PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs

public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs
@ 2026-04-17  9:26 Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox v6 01/15] pbs api types: add `worker-threads` to sync job config Christian Ebner
                   ` (16 more replies)
  0 siblings, 17 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Syncing contents from/to a remote source via a sync job suffers from
low throughput on high latency networks because of limitations by the
HTTP/2 connection, as described in [0]. To improve, syncing multiple
groups in parallel by establishing multiple reader instances has been
suggested.

This patch series implements the functionality by adding the sync job
configuration property `worker-threads`, allowing to define the
number of groups pull/push tokio tasks to be executed in parallel on
the runtime during each job.

Examplary configuration:
```
sync: s-8764c440-3a6c
	ns
	owner root@pam
	remote local
	remote-ns
	remote-store push-target-store
	remove-vanished false
	store datastore
	sync-direction push
	worker-threads 4
```

Since log messages are now also written concurrently, prefix logs
related to groups, snapshots and archives with their respective
context prefix and add context to error messages.

To reduce interwoven log messages from log lines arriving in fast
succession from different group workers, implement a buffer logic to
keep up to 5 lines buffered with a timeout of 1 second. This helps to
follow log lines.

Further, improve logging especially for sync jobs in push direction,
which only displayed limited information so far.

[0] https://bugzilla.proxmox.com/show_bug.cgi?id=4182

Change since version 5 (thanks @Fabian):
- Implement buffered logger for better grouping of fast succession log lines
- Refactor group worker into standalone BoundedJoinSet implementation.
- Improve log output by using better prefixes
- Add missing error contexts

Change since version 4:
- Use dedicated tokio tasks to run in parallel on different runtime threads,
  not just multiple concurrent futures on the same thread.
- Rework store progress accounting logic to avoid mutex locks when possible,
  use atomic counters instead.
- Expose setting also in the sync job edit window, not just the config.


proxmox:

Christian Ebner (1):
  pbs api types: add `worker-threads` to sync job config

 pbs-api-types/src/jobs.rs | 11 +++++++++++
 1 file changed, 11 insertions(+)


proxmox-backup:

Christian Ebner (14):
  tools: group and sort module imports
  tools: implement buffered logger for concurrent log messages
  tools: add bounded join set to run concurrent tasks bound by limit
  client: backup writer: fix upload stats size and rate for push sync
  api: config/sync: add optional `worker-threads` property
  sync: pull: revert avoiding reinstantiation for encountered chunks map
  sync: pull: factor out backup group locking and owner check
  sync: pull: prepare pull parameters to be shared across parallel tasks
  fix #4182: server: sync: allow pulling backup groups in parallel
  server: pull: prefix log messages and add error context
  sync: push: prepare push parameters to be shared across parallel tasks
  server: sync: allow pushing groups concurrently
  server: push: prefix log messages and add additional logging
  ui: expose group worker setting in sync job edit window

 pbs-client/src/backup_stats.rs    |  20 +-
 pbs-client/src/backup_writer.rs   |   4 +-
 pbs-tools/Cargo.toml              |   2 +
 pbs-tools/src/bounded_join_set.rs |  69 +++++
 pbs-tools/src/buffered_logger.rs  | 216 ++++++++++++++
 pbs-tools/src/lib.rs              |   5 +-
 src/api2/config/sync.rs           |  10 +
 src/api2/pull.rs                  |   9 +-
 src/api2/push.rs                  |   8 +-
 src/server/pull.rs                | 453 +++++++++++++++++++++---------
 src/server/push.rs                | 349 ++++++++++++++++++-----
 src/server/sync.rs                |  40 ++-
 www/window/SyncJobEdit.js         |  11 +
 13 files changed, 978 insertions(+), 218 deletions(-)
 create mode 100644 pbs-tools/src/bounded_join_set.rs
 create mode 100644 pbs-tools/src/buffered_logger.rs


Summary over all repositories:
  14 files changed, 989 insertions(+), 218 deletions(-)

-- 
Generated by murpp 0.11.0




^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox v6 01/15] pbs api types: add `worker-threads` to sync job config
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 02/15] tools: group and sort module imports Christian Ebner
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Allow to specify the number of concurrent worker threads used to sync
groups for sync jobs. Values can range from the current 1 to 32,
although higher number of threads will saturate with respect to
performance improvements.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- no changes

 pbs-api-types/src/jobs.rs | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
index 7e6dfb94..c4e6dda6 100644
--- a/pbs-api-types/src/jobs.rs
+++ b/pbs-api-types/src/jobs.rs
@@ -88,6 +88,11 @@ pub const VERIFY_JOB_VERIFY_THREADS_SCHEMA: Schema = threads_schema(
     4,
 );
 
+pub const SYNC_WORKER_THREADS_SCHEMA: Schema = threads_schema(
+    "The number of worker threads to process groups in parallel.",
+    1,
+);
+
 #[api(
     properties: {
         "next-run": {
@@ -664,6 +669,10 @@ pub const UNMOUNT_ON_SYNC_DONE_SCHEMA: Schema =
             type: SyncDirection,
             optional: true,
         },
+        "worker-threads": {
+            schema: SYNC_WORKER_THREADS_SCHEMA,
+            optional: true,
+        },
     }
 )]
 #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
@@ -709,6 +718,8 @@ pub struct SyncJobConfig {
     pub unmount_on_done: Option<bool>,
     #[serde(skip_serializing_if = "Option::is_none")]
     pub sync_direction: Option<SyncDirection>,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub worker_threads: Option<usize>,
 }
 
 impl SyncJobConfig {
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 02/15] tools: group and sort module imports
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox v6 01/15] pbs api types: add `worker-threads` to sync job config Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 03/15] tools: implement buffered logger for concurrent log messages Christian Ebner
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Makes it easier to find and insert new modules with some logical
consistency.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- not present in previous version

 pbs-tools/src/lib.rs | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/pbs-tools/src/lib.rs b/pbs-tools/src/lib.rs
index af900c925..f41aef6df 100644
--- a/pbs-tools/src/lib.rs
+++ b/pbs-tools/src/lib.rs
@@ -1,3 +1,4 @@
+pub mod async_lru_cache;
 pub mod cert;
 pub mod crypt_config;
 pub mod format;
@@ -6,8 +7,6 @@ pub mod lru_cache;
 pub mod nom;
 pub mod sha;
 
-pub mod async_lru_cache;
-
 /// Set MMAP_THRESHOLD to a fixed value (128 KiB)
 ///
 /// This avoids the "dynamic" mmap-threshold logic from glibc's malloc, which seems misguided and
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 03/15] tools: implement buffered logger for concurrent log messages
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox v6 01/15] pbs api types: add `worker-threads` to sync job config Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 02/15] tools: group and sort module imports Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-20 10:57   ` Fabian Grünbichler
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 04/15] tools: add bounded join set to run concurrent tasks bound by limit Christian Ebner
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Implements a buffered logger instance which collects messages send
from different sender instances via an async tokio channel and
buffers them. Sender identify by label and provide a log level for
each log line to be buffered and flushed.

On collection, log lines are grouped by label and buffered in
sequence of arrival per label, up to the configured maximum number of
per group lines or periodically with the configured interval. The
interval timeout is reset when contents are flushed. In addition,
senders can request flushing at any given point.

When the timeout set based on the interval is reached, all labels
log buffers are flushed. There is no guarantee on the order of labels
when flushing.

Log output is written based on provided log line level and prefixed
by the label.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- not present in previous version

 pbs-tools/Cargo.toml             |   2 +
 pbs-tools/src/buffered_logger.rs | 216 +++++++++++++++++++++++++++++++
 pbs-tools/src/lib.rs             |   1 +
 3 files changed, 219 insertions(+)
 create mode 100644 pbs-tools/src/buffered_logger.rs

diff --git a/pbs-tools/Cargo.toml b/pbs-tools/Cargo.toml
index 998e3077e..6b1d92fa6 100644
--- a/pbs-tools/Cargo.toml
+++ b/pbs-tools/Cargo.toml
@@ -17,10 +17,12 @@ openssl.workspace = true
 serde_json.workspace = true
 # rt-multi-thread is required for block_in_place
 tokio = { workspace = true, features = [ "fs", "io-util", "rt", "rt-multi-thread", "sync" ] }
+tracing.workspace = true
 
 proxmox-async.workspace = true
 proxmox-io = { workspace = true, features = [ "tokio" ] }
 proxmox-human-byte.workspace = true
+proxmox-log.workspace = true
 proxmox-sys.workspace = true
 proxmox-time.workspace = true
 
diff --git a/pbs-tools/src/buffered_logger.rs b/pbs-tools/src/buffered_logger.rs
new file mode 100644
index 000000000..39cf068cd
--- /dev/null
+++ b/pbs-tools/src/buffered_logger.rs
@@ -0,0 +1,216 @@
+//! Log aggregator to collect and group messages send from concurrent tasks via
+//! a tokio channel.
+
+use std::collections::hash_map::Entry;
+use std::collections::HashMap;
+use std::time::Duration;
+
+use anyhow::Error;
+use tokio::sync::mpsc;
+use tokio::time::{self, Instant};
+use tracing::{debug, error, info, trace, warn, Level};
+
+use proxmox_log::LogContext;
+
+/// Label to be used to group currently buffered messages when flushing.
+pub type SenderLabel = String;
+
+/// Requested action for the log collection task
+enum SenderRequest {
+    // new log line to be buffered
+    Message(LogLine),
+    // flush currently buffered log lines associated by sender label
+    Flush(SenderLabel),
+}
+
+/// Logger instance to buffer and group log output to keep concurrent logs readable
+///
+/// Receives the logs from an async input channel, buffers them grouped by input
+/// channel and flushes them after either reaching a timeout or capacity limit.
+pub struct BufferedLogger {
+    // buffer to aggregate log lines based on sender label
+    buffer_map: HashMap<SenderLabel, Vec<LogLine>>,
+    // maximum number of received lines for an individual sender instance before
+    // flushing
+    max_buffered_lines: usize,
+    // maximum aggregation duration of received lines for an individual sender
+    // instance before flushing
+    max_aggregation_time: Duration,
+    // channel to receive log messages
+    receiver: mpsc::Receiver<SenderRequest>,
+}
+
+/// Instance to create new sender instances by cloning the channel sender
+pub struct LogLineSenderBuilder {
+    // to clone new senders if requested
+    _sender: mpsc::Sender<SenderRequest>,
+}
+
+impl LogLineSenderBuilder {
+    /// Create new sender instance to send log messages, to be grouped by given label
+    ///
+    /// Label is not checked to be unique (no other instance with same label exists),
+    /// it is the callers responsibility to check so if required.
+    pub fn sender_with_label(&self, label: SenderLabel) -> LogLineSender {
+        LogLineSender {
+            label,
+            sender: self._sender.clone(),
+        }
+    }
+}
+
+/// Sender to publish new log messages to buffered log aggregator
+pub struct LogLineSender {
+    // label used to group log lines
+    label: SenderLabel,
+    // sender to publish new log lines to buffered log aggregator task
+    sender: mpsc::Sender<SenderRequest>,
+}
+
+impl LogLineSender {
+    /// Send a new log message with given level to the buffered logger task
+    pub async fn log(&self, level: Level, message: String) -> Result<(), Error> {
+        let line = LogLine {
+            label: self.label.clone(),
+            level,
+            message,
+        };
+        self.sender.send(SenderRequest::Message(line)).await?;
+        Ok(())
+    }
+
+    /// Flush all messages with sender's label
+    pub async fn flush(&self) -> Result<(), Error> {
+        self.sender
+            .send(SenderRequest::Flush(self.label.clone()))
+            .await?;
+        Ok(())
+    }
+}
+
+/// Log message entity
+struct LogLine {
+    /// label indentifiying the sender
+    label: SenderLabel,
+    /// Log level to use during flushing
+    level: Level,
+    /// log line to be buffered and flushed
+    message: String,
+}
+
+impl BufferedLogger {
+    /// New instance of a buffered logger
+    pub fn new(
+        max_buffered_lines: usize,
+        max_aggregation_time: Duration,
+    ) -> (Self, LogLineSenderBuilder) {
+        let (_sender, receiver) = mpsc::channel(100);
+
+        (
+            Self {
+                buffer_map: HashMap::new(),
+                max_buffered_lines,
+                max_aggregation_time,
+                receiver,
+            },
+            LogLineSenderBuilder { _sender },
+        )
+    }
+
+    /// Starts the collection loop spawned on a new tokio task
+    /// Finishes when all sender belonging to the channel have been dropped.
+    pub fn run_log_collection(mut self) {
+        let future = async move {
+            loop {
+                let deadline = Instant::now() + self.max_aggregation_time;
+                match time::timeout_at(deadline, self.receive_log_line()).await {
+                    Ok(finished) => {
+                        if finished {
+                            break;
+                        }
+                    }
+                    Err(_timeout) => self.flush_all_buffered(),
+                }
+            }
+        };
+        match LogContext::current() {
+            None => tokio::spawn(future),
+            Some(context) => tokio::spawn(context.scope(future)),
+        };
+    }
+
+    /// Collects new log lines, buffers and flushes them if max lines limit exceeded.
+    ///
+    /// Returns `true` if all the senders have been dropped and the task should no
+    /// longer wait for new messages and finish.
+    async fn receive_log_line(&mut self) -> bool {
+        if let Some(request) = self.receiver.recv().await {
+            match request {
+                SenderRequest::Flush(label) => {
+                    if let Some(log_lines) = self.buffer_map.get_mut(&label) {
+                        Self::log_with_label(&label, log_lines);
+                        log_lines.clear();
+                    }
+                }
+                SenderRequest::Message(log_line) => {
+                    if self.max_buffered_lines == 0
+                        || self.max_aggregation_time < Duration::from_secs(0)
+                    {
+                        // shortcut if no buffering should happen
+                        Self::log_by_level(&log_line.label, &log_line);
+                    }
+
+                    match self.buffer_map.entry(log_line.label.clone()) {
+                        Entry::Occupied(mut occupied) => {
+                            let log_lines = occupied.get_mut();
+                            if log_lines.len() + 1 > self.max_buffered_lines {
+                                // reached limit for this label,
+                                // flush all buffered and new log line
+                                Self::log_with_label(&log_line.label, log_lines);
+                                log_lines.clear();
+                                Self::log_by_level(&log_line.label, &log_line);
+                            } else {
+                                // below limit, push to buffer to flush later
+                                log_lines.push(log_line);
+                            }
+                        }
+                        Entry::Vacant(vacant) => {
+                            vacant.insert(vec![log_line]);
+                        }
+                    }
+                }
+            }
+            return false;
+        }
+
+        // no more senders, all LogLineSender's and LogLineSenderBuilder have been dropped
+        self.flush_all_buffered();
+        true
+    }
+
+    /// Flush all currently buffered contents without ordering, but grouped by label
+    fn flush_all_buffered(&mut self) {
+        for (label, log_lines) in self.buffer_map.iter() {
+            Self::log_with_label(label, log_lines);
+        }
+        self.buffer_map.clear();
+    }
+
+    /// Log given log lines prefixed by label
+    fn log_with_label(label: &str, log_lines: &[LogLine]) {
+        for log_line in log_lines {
+            Self::log_by_level(label, log_line);
+        }
+    }
+
+    /// Write the given log line prefixed by label
+    fn log_by_level(label: &str, log_line: &LogLine) {
+        match log_line.level {
+            Level::ERROR => error!("[{label}]: {}", log_line.message),
+            Level::WARN => warn!("[{label}]: {}", log_line.message),
+            Level::INFO => info!("[{label}]: {}", log_line.message),
+            Level::DEBUG => debug!("[{label}]: {}", log_line.message),
+            Level::TRACE => trace!("[{label}]: {}", log_line.message),
+        }
+    }
+}
diff --git a/pbs-tools/src/lib.rs b/pbs-tools/src/lib.rs
index f41aef6df..1e3972c92 100644
--- a/pbs-tools/src/lib.rs
+++ b/pbs-tools/src/lib.rs
@@ -1,4 +1,5 @@
 pub mod async_lru_cache;
+pub mod buffered_logger;
 pub mod cert;
 pub mod crypt_config;
 pub mod format;
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 04/15] tools: add bounded join set to run concurrent tasks bound by limit
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (2 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 03/15] tools: implement buffered logger for concurrent log messages Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-20 11:15   ` Fabian Grünbichler
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 05/15] client: backup writer: fix upload stats size and rate for push sync Christian Ebner
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

The BoundedJoinSet allows to run tasks concurrently via a JoinSet,
but constrains the number of concurrent tasks to be run at once by an
upper limit.

In contrast to the ParallelHandler implementation, which is purely
sync implementation and does not provide easy handling for returned
results, rhis allows to execute tasks in an async context with straight
forward handling of results, as required for e.g. pulling/pushing of
backup groups in parallel for sync jobs. Also, log context is easily
preserved, which is of importance for task logging.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- not present in previous version, refactored logic from previous
  GroupWorker implementation.

 pbs-tools/src/bounded_join_set.rs | 69 +++++++++++++++++++++++++++++++
 pbs-tools/src/lib.rs              |  1 +
 2 files changed, 70 insertions(+)
 create mode 100644 pbs-tools/src/bounded_join_set.rs

diff --git a/pbs-tools/src/bounded_join_set.rs b/pbs-tools/src/bounded_join_set.rs
new file mode 100644
index 000000000..01b27b2a6
--- /dev/null
+++ b/pbs-tools/src/bounded_join_set.rs
@@ -0,0 +1,69 @@
+//! JoinSet with an upper bound of concurrent tasks.
+//!
+//! Allows to run up to the configured number of tasks concurrently in an async
+//! context.
+
+use std::future::Future;
+
+use tokio::task::{JoinError, JoinSet};
+
+use proxmox_log::LogContext;
+
+/// Run up to preconfigured number of futures concurrently on tokio tasks.
+pub struct BoundedJoinSet<T> {
+    // upper bound for concurrent task execution
+    max_tasks: usize,
+    // handles to currently active tasks
+    workers: JoinSet<T>,
+}
+
+impl<T: Send + 'static> BoundedJoinSet<T> {
+    /// Create a new join set with up to `max_task` concurrently executed tasks.
+    pub fn new(max_tasks: usize) -> Self {
+        Self {
+            max_tasks,
+            workers: JoinSet::new(),
+        }
+    }
+
+    /// Spawn the given task on the workers, waiting until there is capacity to do so.
+    ///
+    /// If there is no capacity, this will await until there is so, returning the results
+    /// for the finished task(s) providing the now free running slot in order of completion
+    /// or a `JoinError` if joining failed.
+    pub async fn spawn_task<F>(&mut self, task: F) -> Result<Vec<T>, JoinError>
+    where
+        F: Future<Output = T>,
+        F: Send + 'static,
+    {
+        let mut results = Vec::with_capacity(self.workers.len());
+
+        while self.workers.len() >= self.max_tasks {
+            // capacity reached, wait for an active task to complete
+            if let Some(result) = self.workers.join_next().await {
+                results.push(result?);
+            }
+        }
+
+        match LogContext::current() {
+            Some(context) => self.workers.spawn(context.scope(task)),
+            None => self.workers.spawn(task),
+        };
+
+        Ok(results)
+    }
+
+    /// Wait on all active tasks to run to completion.
+    ///
+    /// Returns the results for each task in order of completion or a `JoinError`
+    /// if joining failed.
+    pub async fn join_active(&mut self) -> Result<Vec<T>, JoinError> {
+        let mut results = Vec::with_capacity(self.workers.len());
+
+        while let Some(result) = self.workers.join_next().await {
+            results.push(result?);
+        }
+
+        Ok(results)
+    }
+}
diff --git a/pbs-tools/src/lib.rs b/pbs-tools/src/lib.rs
index 1e3972c92..dc55366b6 100644
--- a/pbs-tools/src/lib.rs
+++ b/pbs-tools/src/lib.rs
@@ -1,4 +1,5 @@
 pub mod async_lru_cache;
+pub mod bounded_join_set;
 pub mod buffered_logger;
 pub mod cert;
 pub mod crypt_config;
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 05/15] client: backup writer: fix upload stats size and rate for push sync
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (3 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 04/15] tools: add bounded join set to run concurrent tasks bound by limit Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-20 12:29   ` Fabian Grünbichler
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 06/15] api: config/sync: add optional `worker-threads` property Christian Ebner
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Currently, the logical size of the uploaded chunks is used for size
and upload rate calculation in case of sync jobs in push direction,
leading to inflated values for the transferred size and rate.

Use the compressed chunk size instead. To get the required
information, return the more verbose `UploadStats` on
`upload_index_chunk_info` calls and use it's compressed size for the
transferred `bytes` of `SyncStats` instead. Since `UploadStats` is
now part of a pub api, increase it's scope as well.

This is then finally being used to display the upload size and
calculate the rate for the push sync job.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- no changes

 pbs-client/src/backup_stats.rs  | 20 ++++++++++----------
 pbs-client/src/backup_writer.rs |  4 ++--
 src/server/push.rs              |  4 ++--
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/pbs-client/src/backup_stats.rs b/pbs-client/src/backup_stats.rs
index f0563a001..edf7ef3c4 100644
--- a/pbs-client/src/backup_stats.rs
+++ b/pbs-client/src/backup_stats.rs
@@ -15,16 +15,16 @@ pub struct BackupStats {
 }
 
 /// Extended backup run statistics and archive checksum
-pub(crate) struct UploadStats {
-    pub(crate) chunk_count: usize,
-    pub(crate) chunk_reused: usize,
-    pub(crate) chunk_injected: usize,
-    pub(crate) size: usize,
-    pub(crate) size_reused: usize,
-    pub(crate) size_injected: usize,
-    pub(crate) size_compressed: usize,
-    pub(crate) duration: Duration,
-    pub(crate) csum: [u8; 32],
+pub struct UploadStats {
+    pub chunk_count: usize,
+    pub chunk_reused: usize,
+    pub chunk_injected: usize,
+    pub size: usize,
+    pub size_reused: usize,
+    pub size_injected: usize,
+    pub size_compressed: usize,
+    pub duration: Duration,
+    pub csum: [u8; 32],
 }
 
 impl UploadStats {
diff --git a/pbs-client/src/backup_writer.rs b/pbs-client/src/backup_writer.rs
index 49aff3fdd..4a4391c8b 100644
--- a/pbs-client/src/backup_writer.rs
+++ b/pbs-client/src/backup_writer.rs
@@ -309,7 +309,7 @@ impl BackupWriter {
         archive_name: &BackupArchiveName,
         stream: impl Stream<Item = Result<MergedChunkInfo, Error>>,
         options: UploadOptions,
-    ) -> Result<BackupStats, Error> {
+    ) -> Result<UploadStats, Error> {
         let mut param = json!({ "archive-name": archive_name });
         let (prefix, archive_size) = options.index_type.to_prefix_and_size();
         if let Some(size) = archive_size {
@@ -391,7 +391,7 @@ impl BackupWriter {
             .post(&format!("{prefix}_close"), Some(param))
             .await?;
 
-        Ok(upload_stats.to_backup_stats())
+        Ok(upload_stats)
     }
 
     pub async fn upload_stream(
diff --git a/src/server/push.rs b/src/server/push.rs
index 697b94f2f..494e0fbce 100644
--- a/src/server/push.rs
+++ b/src/server/push.rs
@@ -1059,8 +1059,8 @@ async fn push_index(
         .await?;
 
     Ok(SyncStats {
-        chunk_count: upload_stats.chunk_count as usize,
-        bytes: upload_stats.size as usize,
+        chunk_count: upload_stats.chunk_count,
+        bytes: upload_stats.size_compressed,
         elapsed: upload_stats.duration,
         removed: None,
     })
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 06/15] api: config/sync: add optional `worker-threads` property
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (4 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 05/15] client: backup writer: fix upload stats size and rate for push sync Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 07/15] sync: pull: revert avoiding reinstantiation for encountered chunks map Christian Ebner
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Allow to configure from 1 up to 32 worker threads to perform
multiple group syncs in parallel.

The property is exposed via the sync job config and passed to
the pull/push parameters for the sync job to setup and execute the
thread pool accordingly.

Implements the schema definitions and includes the new property to
the `SyncJobConfig`, `PullParameters` and `PushParameters`.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- no changes

 src/api2/config/sync.rs | 10 ++++++++++
 src/api2/pull.rs        |  9 ++++++++-
 src/api2/push.rs        |  8 +++++++-
 src/server/pull.rs      |  4 ++++
 src/server/push.rs      |  4 ++++
 src/server/sync.rs      |  1 +
 6 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
index 67fa3182c..0f073ca54 100644
--- a/src/api2/config/sync.rs
+++ b/src/api2/config/sync.rs
@@ -344,6 +344,8 @@ pub enum DeletableProperty {
     UnmountOnDone,
     /// Delete the sync_direction property,
     SyncDirection,
+    /// Delete the worker_threads property,
+    WorkerThreads,
 }
 
 #[api(
@@ -467,6 +469,9 @@ pub fn update_sync_job(
                 DeletableProperty::SyncDirection => {
                     data.sync_direction = None;
                 }
+                DeletableProperty::WorkerThreads => {
+                    data.worker_threads = None;
+                }
             }
         }
     }
@@ -526,6 +531,10 @@ pub fn update_sync_job(
         data.sync_direction = Some(sync_direction);
     }
 
+    if let Some(worker_threads) = update.worker_threads {
+        data.worker_threads = Some(worker_threads);
+    }
+
     if update.limit.rate_in.is_some() {
         data.limit.rate_in = update.limit.rate_in;
     }
@@ -698,6 +707,7 @@ acl:1:/remote/remote1/remotestore1:write@pbs:RemoteSyncOperator
         run_on_mount: None,
         unmount_on_done: None,
         sync_direction: None, // use default
+        worker_threads: None,
     };
 
     // should work without ACLs
diff --git a/src/api2/pull.rs b/src/api2/pull.rs
index 4b1fd5e60..7cf165f91 100644
--- a/src/api2/pull.rs
+++ b/src/api2/pull.rs
@@ -11,7 +11,7 @@ use pbs_api_types::{
     GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP,
     PRIV_DATASTORE_PRUNE, PRIV_REMOTE_READ, REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA,
     RESYNC_CORRUPT_SCHEMA, SYNC_ENCRYPTED_ONLY_SCHEMA, SYNC_VERIFIED_ONLY_SCHEMA,
-    TRANSFER_LAST_SCHEMA,
+    SYNC_WORKER_THREADS_SCHEMA, TRANSFER_LAST_SCHEMA,
 };
 use pbs_config::CachedUserInfo;
 use proxmox_rest_server::WorkerTask;
@@ -91,6 +91,7 @@ impl TryFrom<&SyncJobConfig> for PullParameters {
             sync_job.encrypted_only,
             sync_job.verified_only,
             sync_job.resync_corrupt,
+            sync_job.worker_threads,
         )
     }
 }
@@ -148,6 +149,10 @@ impl TryFrom<&SyncJobConfig> for PullParameters {
                 schema: RESYNC_CORRUPT_SCHEMA,
                 optional: true,
             },
+            "worker-threads": {
+                schema: SYNC_WORKER_THREADS_SCHEMA,
+                optional: true,
+            },
         },
     },
     access: {
@@ -175,6 +180,7 @@ async fn pull(
     encrypted_only: Option<bool>,
     verified_only: Option<bool>,
     resync_corrupt: Option<bool>,
+    worker_threads: Option<usize>,
     rpcenv: &mut dyn RpcEnvironment,
 ) -> Result<String, Error> {
     let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?;
@@ -215,6 +221,7 @@ async fn pull(
         encrypted_only,
         verified_only,
         resync_corrupt,
+        worker_threads,
     )?;
 
     // fixme: set to_stdout to false?
diff --git a/src/api2/push.rs b/src/api2/push.rs
index e5edc13e0..f27f4ea1a 100644
--- a/src/api2/push.rs
+++ b/src/api2/push.rs
@@ -6,7 +6,7 @@ use pbs_api_types::{
     GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP,
     PRIV_DATASTORE_READ, PRIV_REMOTE_DATASTORE_BACKUP, PRIV_REMOTE_DATASTORE_PRUNE,
     REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, SYNC_ENCRYPTED_ONLY_SCHEMA,
-    SYNC_VERIFIED_ONLY_SCHEMA, TRANSFER_LAST_SCHEMA,
+    SYNC_VERIFIED_ONLY_SCHEMA, SYNC_WORKER_THREADS_SCHEMA, TRANSFER_LAST_SCHEMA,
 };
 use proxmox_rest_server::WorkerTask;
 use proxmox_router::{Permission, Router, RpcEnvironment};
@@ -108,6 +108,10 @@ fn check_push_privs(
                 schema: TRANSFER_LAST_SCHEMA,
                 optional: true,
             },
+            "worker-threads": {
+                schema: SYNC_WORKER_THREADS_SCHEMA,
+                optional: true,
+            },
         },
     },
     access: {
@@ -133,6 +137,7 @@ async fn push(
     verified_only: Option<bool>,
     limit: RateLimitConfig,
     transfer_last: Option<usize>,
+    worker_threads: Option<usize>,
     rpcenv: &mut dyn RpcEnvironment,
 ) -> Result<String, Error> {
     let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?;
@@ -164,6 +169,7 @@ async fn push(
         verified_only,
         limit,
         transfer_last,
+        worker_threads,
     )
     .await?;
 
diff --git a/src/server/pull.rs b/src/server/pull.rs
index bd3e8bef4..ca17eb243 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -65,6 +65,8 @@ pub(crate) struct PullParameters {
     verified_only: bool,
     /// Whether to re-sync corrupted snapshots
     resync_corrupt: bool,
+    /// Maximum number of worker threads to pull during sync job
+    worker_threads: Option<usize>,
 }
 
 impl PullParameters {
@@ -85,6 +87,7 @@ impl PullParameters {
         encrypted_only: Option<bool>,
         verified_only: Option<bool>,
         resync_corrupt: Option<bool>,
+        worker_threads: Option<usize>,
     ) -> Result<Self, Error> {
         if let Some(max_depth) = max_depth {
             ns.check_max_depth(max_depth)?;
@@ -137,6 +140,7 @@ impl PullParameters {
             encrypted_only,
             verified_only,
             resync_corrupt,
+            worker_threads,
         })
     }
 }
diff --git a/src/server/push.rs b/src/server/push.rs
index 494e0fbce..44a204e6b 100644
--- a/src/server/push.rs
+++ b/src/server/push.rs
@@ -83,6 +83,8 @@ pub(crate) struct PushParameters {
     verified_only: bool,
     /// How many snapshots should be transferred at most (taking the newest N snapshots)
     transfer_last: Option<usize>,
+    /// Maximum number of worker threads for push during sync job
+    worker_threads: Option<usize>,
 }
 
 impl PushParameters {
@@ -102,6 +104,7 @@ impl PushParameters {
         verified_only: Option<bool>,
         limit: RateLimitConfig,
         transfer_last: Option<usize>,
+        worker_threads: Option<usize>,
     ) -> Result<Self, Error> {
         if let Some(max_depth) = max_depth {
             ns.check_max_depth(max_depth)?;
@@ -165,6 +168,7 @@ impl PushParameters {
             encrypted_only,
             verified_only,
             transfer_last,
+            worker_threads,
         })
     }
 
diff --git a/src/server/sync.rs b/src/server/sync.rs
index aedf4a271..9e6aeb9b0 100644
--- a/src/server/sync.rs
+++ b/src/server/sync.rs
@@ -675,6 +675,7 @@ pub fn do_sync_job(
                             sync_job.verified_only,
                             sync_job.limit.clone(),
                             sync_job.transfer_last,
+                            sync_job.worker_threads,
                         )
                         .await?;
                         push_store(push_params).await?
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 07/15] sync: pull: revert avoiding reinstantiation for encountered chunks map
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (5 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 06/15] api: config/sync: add optional `worker-threads` property Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 08/15] sync: pull: factor out backup group locking and owner check Christian Ebner
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

While keeping a store wide instance to avoid reinstantiation on each
group is desired when iteratively processing groups, this cannot work
when performing the sync of multiple groups in parallel.

This is in preparation for parallel group syncs and reverts commit
ecdec5bc ("sync: pull: avoid reinstantiation for encountered chunks
map").

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- no changes

 src/server/pull.rs | 26 +++++---------------------
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/src/server/pull.rs b/src/server/pull.rs
index ca17eb243..45fe9f8b1 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -620,7 +620,6 @@ async fn pull_group(
     source_namespace: &BackupNamespace,
     group: &BackupGroup,
     progress: &mut StoreProgress,
-    encountered_chunks: Arc<Mutex<EncounteredChunks>>,
 ) -> Result<SyncStats, Error> {
     let mut already_synced_skip_info = SkipInfo::new(SkipReason::AlreadySynced);
     let mut transfer_last_skip_info = SkipInfo::new(SkipReason::TransferLast);
@@ -721,6 +720,9 @@ async fn pull_group(
         transfer_last_skip_info.reset();
     }
 
+    // start with 65536 chunks (up to 256 GiB)
+    let encountered_chunks = Arc::new(Mutex::new(EncounteredChunks::with_capacity(1024 * 64)));
+
     let backup_group = params
         .target
         .store
@@ -984,9 +986,6 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
     let mut synced_ns = HashSet::with_capacity(namespaces.len());
     let mut sync_stats = SyncStats::default();
 
-    // start with 65536 chunks (up to 256 GiB)
-    let encountered_chunks = Arc::new(Mutex::new(EncounteredChunks::with_capacity(1024 * 64)));
-
     for namespace in namespaces {
         let source_store_ns_str = print_store_and_ns(params.source.get_store(), &namespace);
 
@@ -1008,7 +1007,7 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
             }
         }
 
-        match pull_ns(&namespace, &mut params, encountered_chunks.clone()).await {
+        match pull_ns(&namespace, &mut params).await {
             Ok((ns_progress, ns_sync_stats, ns_errors)) => {
                 errors |= ns_errors;
 
@@ -1066,7 +1065,6 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
 async fn pull_ns(
     namespace: &BackupNamespace,
     params: &mut PullParameters,
-    encountered_chunks: Arc<Mutex<EncounteredChunks>>,
 ) -> Result<(StoreProgress, SyncStats, bool), Error> {
     let list: Vec<BackupGroup> = params.source.list_groups(namespace, &params.owner).await?;
 
@@ -1125,16 +1123,7 @@ async fn pull_ns(
             );
             errors = true; // do not stop here, instead continue
         } else {
-            encountered_chunks.lock().unwrap().clear();
-            match pull_group(
-                params,
-                namespace,
-                &group,
-                &mut progress,
-                encountered_chunks.clone(),
-            )
-            .await
-            {
+            match pull_group(params, namespace, &group, &mut progress).await {
                 Ok(stats) => sync_stats.add(stats),
                 Err(err) => {
                     info!("sync group {} failed - {err:#}", &group);
@@ -1255,9 +1244,4 @@ impl EncounteredChunks {
             }
         }
     }
-
-    /// Clear all entries
-    fn clear(&mut self) {
-        self.chunk_set.clear();
-    }
 }
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 08/15] sync: pull: factor out backup group locking and owner check
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (6 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 07/15] sync: pull: revert avoiding reinstantiation for encountered chunks map Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 09/15] sync: pull: prepare pull parameters to be shared across parallel tasks Christian Ebner
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Creates a dedicated entry point for parallel group pulling and
simplifies the backup group loop logic.

While locking and owner check could have been moved to pull_group()
as well, that function is already hard to parse as is. Logging of
errors is moved to the helper to facilitate it for parallel pulling.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- adapted to better fit subsequent introduction of log line sender

 src/server/pull.rs | 76 +++++++++++++++++++++++++++-------------------
 1 file changed, 44 insertions(+), 32 deletions(-)

diff --git a/src/server/pull.rs b/src/server/pull.rs
index 45fe9f8b1..7126a5102 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -1050,6 +1050,47 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
     Ok(sync_stats)
 }
 
+/// Get and exclusive lock on the backup group, check ownership matches
+/// sync job owner and pull group contents.
+async fn lock_and_pull_group(
+    params: &PullParameters,
+    group: &BackupGroup,
+    namespace: &BackupNamespace,
+    target_namespace: &BackupNamespace,
+    progress: &mut StoreProgress,
+) -> Result<SyncStats, Error> {
+    let (owner, _lock_guard) =
+        match params
+            .target
+            .store
+            .create_locked_backup_group(target_namespace, group, &params.owner)
+        {
+            Ok(res) => res,
+            Err(err) => {
+                info!("sync group {group} failed - group lock failed: {err}");
+                info!("create_locked_backup_group failed");
+                return Err(err);
+            }
+        };
+
+    if params.owner != owner {
+        // only the owner is allowed to create additional snapshots
+        info!(
+            "sync group {group} failed - owner check failed ({} != {owner})",
+            params.owner
+        );
+        return Err(format_err!("owner check failed"));
+    }
+
+    match pull_group(params, namespace, group, progress).await {
+        Ok(stats) => Ok(stats),
+        Err(err) => {
+            info!("sync group {group} failed - {err:#}");
+            Err(err)
+        }
+    }
+}
+
 /// Pulls a namespace according to `params`.
 ///
 /// Pulling a namespace consists of the following steps:
@@ -1098,38 +1139,9 @@ async fn pull_ns(
         progress.done_snapshots = 0;
         progress.group_snapshots = 0;
 
-        let (owner, _lock_guard) =
-            match params
-                .target
-                .store
-                .create_locked_backup_group(&target_ns, &group, &params.owner)
-            {
-                Ok(result) => result,
-                Err(err) => {
-                    info!("sync group {} failed - group lock failed: {err}", &group);
-                    errors = true;
-                    // do not stop here, instead continue
-                    info!("create_locked_backup_group failed");
-                    continue;
-                }
-            };
-
-        // permission check
-        if params.owner != owner {
-            // only the owner is allowed to create additional snapshots
-            info!(
-                "sync group {} failed - owner check failed ({} != {owner})",
-                &group, params.owner
-            );
-            errors = true; // do not stop here, instead continue
-        } else {
-            match pull_group(params, namespace, &group, &mut progress).await {
-                Ok(stats) => sync_stats.add(stats),
-                Err(err) => {
-                    info!("sync group {} failed - {err:#}", &group);
-                    errors = true; // do not stop here, instead continue
-                }
-            }
+        match lock_and_pull_group(params, &group, &namespace, &target_ns, &mut progress).await {
+            Ok(stats) => sync_stats.add(stats),
+            Err(_err) => errors = true,
         }
     }
 
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 09/15] sync: pull: prepare pull parameters to be shared across parallel tasks
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (7 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 08/15] sync: pull: factor out backup group locking and owner check Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 10/15] fix #4182: server: sync: allow pulling backup groups in parallel Christian Ebner
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

When performing parallel group syncs, the pull parameters must be
shared between all tasks which is not possible with regular
references due to lifetime and ownership issues. Pack them into an
atomic reference counter instead so they can easily be cloned when
required.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- no changes

 src/server/pull.rs | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/src/server/pull.rs b/src/server/pull.rs
index 7126a5102..5beca6b8d 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -380,7 +380,7 @@ async fn pull_single_archive<'a>(
 ///   -- if not, pull it from the remote
 /// - Download log if not already existing
 async fn pull_snapshot<'a>(
-    params: &PullParameters,
+    params: Arc<PullParameters>,
     reader: Arc<dyn SyncSourceReader + 'a>,
     snapshot: &'a pbs_datastore::BackupDir,
     encountered_chunks: Arc<Mutex<EncounteredChunks>>,
@@ -558,7 +558,7 @@ async fn pull_snapshot<'a>(
 /// The `reader` is configured to read from the source backup directory, while the
 /// `snapshot` is pointing to the local datastore and target namespace.
 async fn pull_snapshot_from<'a>(
-    params: &PullParameters,
+    params: Arc<PullParameters>,
     reader: Arc<dyn SyncSourceReader + 'a>,
     snapshot: &'a pbs_datastore::BackupDir,
     encountered_chunks: Arc<Mutex<EncounteredChunks>>,
@@ -616,7 +616,7 @@ async fn pull_snapshot_from<'a>(
 /// - remote snapshot access is checked by remote (twice: query and opening the backup reader)
 /// - local group owner is already checked by pull_store
 async fn pull_group(
-    params: &PullParameters,
+    params: Arc<PullParameters>,
     source_namespace: &BackupNamespace,
     group: &BackupGroup,
     progress: &mut StoreProgress,
@@ -797,7 +797,7 @@ async fn pull_group(
             .reader(source_namespace, &from_snapshot)
             .await?;
         let result = pull_snapshot_from(
-            params,
+            Arc::clone(&params),
             reader,
             &to_snapshot,
             encountered_chunks.clone(),
@@ -985,6 +985,7 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
     let (mut groups, mut snapshots) = (0, 0);
     let mut synced_ns = HashSet::with_capacity(namespaces.len());
     let mut sync_stats = SyncStats::default();
+    let params = Arc::new(params);
 
     for namespace in namespaces {
         let source_store_ns_str = print_store_and_ns(params.source.get_store(), &namespace);
@@ -1007,7 +1008,7 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
             }
         }
 
-        match pull_ns(&namespace, &mut params).await {
+        match pull_ns(&namespace, Arc::clone(&params)).await {
             Ok((ns_progress, ns_sync_stats, ns_errors)) => {
                 errors |= ns_errors;
 
@@ -1053,7 +1054,7 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
 /// Get and exclusive lock on the backup group, check ownership matches
 /// sync job owner and pull group contents.
 async fn lock_and_pull_group(
-    params: &PullParameters,
+    params: Arc<PullParameters>,
     group: &BackupGroup,
     namespace: &BackupNamespace,
     target_namespace: &BackupNamespace,
@@ -1105,7 +1106,7 @@ async fn lock_and_pull_group(
 /// - owner check for vanished groups done here
 async fn pull_ns(
     namespace: &BackupNamespace,
-    params: &mut PullParameters,
+    params: Arc<PullParameters>,
 ) -> Result<(StoreProgress, SyncStats, bool), Error> {
     let list: Vec<BackupGroup> = params.source.list_groups(namespace, &params.owner).await?;
 
@@ -1139,7 +1140,15 @@ async fn pull_ns(
         progress.done_snapshots = 0;
         progress.group_snapshots = 0;
 
-        match lock_and_pull_group(params, &group, &namespace, &target_ns, &mut progress).await {
+        match lock_and_pull_group(
+            Arc::clone(&params),
+            &group,
+            namespace,
+            &target_ns,
+            &mut progress,
+        )
+        .await
+        {
             Ok(stats) => sync_stats.add(stats),
             Err(_err) => errors = true,
         }
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 10/15] fix #4182: server: sync: allow pulling backup groups in parallel
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (8 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 09/15] sync: pull: prepare pull parameters to be shared across parallel tasks Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context Christian Ebner
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Currently, a sync job sequentially pulls the backup groups and the
snapshots contained within them. It is therefore limited in download
speed by the single HTTP/2 connection of the source reader instance
in case of remote syncs. For high latency networks, this suffer from
limited download speed due to head of line blocking.

Improve the throughput by allowing to pull up to a configured number
of backup groups in parallel, by creating a bounded join set, allowing
to which concurrently pulls from the remote source up to the
configured number of tokio tasks. Since these are dedicated tasks,
they can run independent and in parallel on the tokio runtime.

Store progress output is now prefixed by the group as it depends on
the group being pulled since the snapshot count differs. To update
the output on a per group level, the shared group progress count is
passed as atomic counter, the store progress accounted globally as
well as per-group.

Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=4182
Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- uses BoundedJoinSet implementation, refactored accordingly

 src/server/pull.rs | 70 ++++++++++++++++++++++++++++++++--------------
 src/server/sync.rs | 33 ++++++++++++++++++++++
 2 files changed, 82 insertions(+), 21 deletions(-)

diff --git a/src/server/pull.rs b/src/server/pull.rs
index 5beca6b8d..611441d2a 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -26,6 +26,7 @@ use pbs_datastore::index::IndexFile;
 use pbs_datastore::manifest::{BackupManifest, FileInfo};
 use pbs_datastore::read_chunk::AsyncReadChunk;
 use pbs_datastore::{check_backup_owner, DataStore, DatastoreBackend, StoreProgress};
+use pbs_tools::bounded_join_set::BoundedJoinSet;
 use pbs_tools::sha::sha256;
 
 use super::sync::{
@@ -34,6 +35,7 @@ use super::sync::{
     SkipReason, SyncSource, SyncSourceReader, SyncStats,
 };
 use crate::backup::{check_ns_modification_privs, check_ns_privs};
+use crate::server::sync::SharedGroupProgress;
 use crate::tools::parallel_handler::ParallelHandler;
 
 pub(crate) struct PullTarget {
@@ -619,7 +621,7 @@ async fn pull_group(
     params: Arc<PullParameters>,
     source_namespace: &BackupNamespace,
     group: &BackupGroup,
-    progress: &mut StoreProgress,
+    shared_group_progress: Arc<SharedGroupProgress>,
 ) -> Result<SyncStats, Error> {
     let mut already_synced_skip_info = SkipInfo::new(SkipReason::AlreadySynced);
     let mut transfer_last_skip_info = SkipInfo::new(SkipReason::TransferLast);
@@ -782,7 +784,8 @@ async fn pull_group(
         }
     }
 
-    progress.group_snapshots = list.len() as u64;
+    let mut local_progress = StoreProgress::new(shared_group_progress.total_groups());
+    local_progress.group_snapshots = list.len() as u64;
 
     let mut sync_stats = SyncStats::default();
 
@@ -805,8 +808,10 @@ async fn pull_group(
         )
         .await;
 
-        progress.done_snapshots = pos as u64 + 1;
-        info!("percentage done: {progress}");
+        // Update done groups progress by other parallel running pulls
+        local_progress.done_groups = shared_group_progress.load_done();
+        local_progress.done_snapshots = pos as u64 + 1;
+        info!("percentage done: group {group}: {local_progress}");
 
         let stats = result?; // stop on error
         sync_stats.add(stats);
@@ -1058,7 +1063,7 @@ async fn lock_and_pull_group(
     group: &BackupGroup,
     namespace: &BackupNamespace,
     target_namespace: &BackupNamespace,
-    progress: &mut StoreProgress,
+    shared_group_progress: Arc<SharedGroupProgress>,
 ) -> Result<SyncStats, Error> {
     let (owner, _lock_guard) =
         match params
@@ -1083,7 +1088,7 @@ async fn lock_and_pull_group(
         return Err(format_err!("owner check failed"));
     }
 
-    match pull_group(params, namespace, group, progress).await {
+    match pull_group(params, namespace, group, shared_group_progress).await {
         Ok(stats) => Ok(stats),
         Err(err) => {
             info!("sync group {group} failed - {err:#}");
@@ -1135,25 +1140,48 @@ async fn pull_ns(
 
     let target_ns = namespace.map_prefix(&params.source.get_ns(), &params.target.ns)?;
 
-    for (done, group) in list.into_iter().enumerate() {
-        progress.done_groups = done as u64;
-        progress.done_snapshots = 0;
-        progress.group_snapshots = 0;
+    let shared_group_progress = Arc::new(SharedGroupProgress::with_total_groups(list.len()));
+    let mut group_workers = BoundedJoinSet::new(params.worker_threads.unwrap_or(1));
 
-        match lock_and_pull_group(
-            Arc::clone(&params),
-            &group,
-            namespace,
-            &target_ns,
-            &mut progress,
-        )
-        .await
-        {
-            Ok(stats) => sync_stats.add(stats),
-            Err(_err) => errors = true,
+    let mut process_results = |results| {
+        for result in results {
+            match result {
+                Ok(stats) => {
+                    sync_stats.add(stats);
+                    progress.done_groups = shared_group_progress.increment_done();
+                }
+                Err(_err) => errors = true,
+            }
         }
+    };
+
+    for group in list.into_iter() {
+        let namespace = namespace.clone();
+        let target_ns = target_ns.clone();
+        let params = Arc::clone(&params);
+        let group_progress_cloned = Arc::clone(&shared_group_progress);
+        let results = group_workers
+            .spawn_task(async move {
+                lock_and_pull_group(
+                    Arc::clone(&params),
+                    &group,
+                    &namespace,
+                    &target_ns,
+                    group_progress_cloned,
+                )
+                .await
+            })
+            .await
+            .map_err(|err| format_err!("failed to join on worker task: {err:#}"))?;
+        process_results(results);
     }
 
+    let results = group_workers
+        .join_active()
+        .await
+        .map_err(|err| format_err!("failed to join on worker task: {err:#}"))?;
+    process_results(results);
+
     if params.remove_vanished {
         let result: Result<(), Error> = proxmox_lang::try_block!({
             for local_group in params.target.store.iter_backup_groups(target_ns.clone())? {
diff --git a/src/server/sync.rs b/src/server/sync.rs
index 9e6aeb9b0..e88418442 100644
--- a/src/server/sync.rs
+++ b/src/server/sync.rs
@@ -4,6 +4,7 @@ use std::collections::HashMap;
 use std::io::{Seek, Write};
 use std::ops::Deref;
 use std::path::{Path, PathBuf};
+use std::sync::atomic::{AtomicUsize, Ordering};
 use std::sync::{Arc, Mutex};
 use std::time::Duration;
 
@@ -12,6 +13,7 @@ use futures::{future::FutureExt, select};
 use hyper::http::StatusCode;
 use pbs_config::BackupLockGuard;
 use serde_json::json;
+use tokio::task::JoinSet;
 use tracing::{info, warn};
 
 use proxmox_human_byte::HumanByte;
@@ -792,3 +794,34 @@ pub(super) fn exclude_not_verified_or_encrypted(
 
     false
 }
+
+/// Track group progress during parallel push/pull in sync jobs
+pub(crate) struct SharedGroupProgress {
+    done: AtomicUsize,
+    total: usize,
+}
+
+impl SharedGroupProgress {
+    /// Create a new instance to track group progress with expected total number of groups
+    pub(crate) fn with_total_groups(total: usize) -> Self {
+        Self {
+            done: AtomicUsize::new(0),
+            total,
+        }
+    }
+
+    /// Return current counter value for done groups
+    pub(crate) fn load_done(&self) -> u64 {
+        self.done.load(Ordering::Acquire) as u64
+    }
+
+    /// Increment counter for done groups and return new value
+    pub(crate) fn increment_done(&self) -> u64 {
+        self.done.fetch_add(1, Ordering::AcqRel) as u64 + 1
+    }
+
+    /// Return the number of total backup groups
+    pub(crate) fn total_groups(&self) -> u64 {
+        self.total as u64
+    }
+}
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (9 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 10/15] fix #4182: server: sync: allow pulling backup groups in parallel Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-20 11:56   ` Fabian Grünbichler
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 12/15] sync: push: prepare push parameters to be shared across parallel tasks Christian Ebner
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Pulling groups and therefore also snapshots in parallel leads to
unordered log outputs, making it mostly impossible to relate a log
message to a backup snapshot/group.

Therefore, prefix pull job log messages by the corresponding group or
snapshot and set the error context accordingly.

Also, reword some messages, inline variables in format strings and
start log lines with capital letters to get consistent output.

By using the buffered logger implementation and buffer up to 5 lines
with a timeout of 1 second, subsequent log lines arriving in fast
succession are kept together, reducing the mixing of lines.

Example output for a sequential pull job:
```
...
[ct/100]: 2025-11-17T10:11:42Z: start sync
[ct/100]: 2025-11-17T10:11:42Z: pct.conf.blob: sync archive
[ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: sync archive
[ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: downloaded 16.785 MiB (280.791 MiB/s)
[ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: sync archive
[ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: downloaded 65.703 KiB (29.1 MiB/s)
[ct/100]: 2025-11-17T10:11:42Z: sync done
[ct/100]: percentage done: 9.09% (1/11 groups)
[ct/101]: 2026-03-31T12:20:16Z: start sync
[ct/101]: 2026-03-31T12:20:16Z: pct.conf.blob: sync archive
[ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: sync archive
[ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: downloaded 199.806 MiB (311.91 MiB/s)
[ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: sync archive
[ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: downloaded 180.379 KiB (22.748 MiB/s)
[ct/101]: 2026-03-31T12:20:16Z: sync done
...
```

Example output for a parallel pull job:
```
...
[ct/107]: 2025-07-16T09:14:01Z: start sync
[ct/107]: 2025-07-16T09:14:01Z: pct.conf.blob: sync archive
[ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: sync archive
[vm/108]: 2025-09-19T07:37:19Z: start sync
[vm/108]: 2025-09-19T07:37:19Z: qemu-server.conf.blob: sync archive
[vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: sync archive
[ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: downloaded 609.233 MiB (112.628 MiB/s)
[ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: sync archive
[ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: downloaded 1.172 MiB (17.838 MiB/s)
[ct/107]: 2025-07-16T09:14:01Z: sync done
[ct/107]: percentage done: 72.73% (8/11 groups)
[vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: downloaded 1.196 GiB (156.892 MiB/s)
[vm/108]: 2025-09-19T07:37:19Z: sync done
...

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- uses BufferedLogger implementation, refactored accordingly
- improve log line prefixes
- add missing error contexts

 src/server/pull.rs | 314 +++++++++++++++++++++++++++++++++------------
 src/server/sync.rs |   8 +-
 2 files changed, 237 insertions(+), 85 deletions(-)

diff --git a/src/server/pull.rs b/src/server/pull.rs
index 611441d2a..f7aae4d59 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -5,11 +5,11 @@ use std::collections::{HashMap, HashSet};
 use std::io::Seek;
 use std::sync::atomic::{AtomicUsize, Ordering};
 use std::sync::{Arc, Mutex};
-use std::time::SystemTime;
+use std::time::{Duration, SystemTime};
 
 use anyhow::{bail, format_err, Context, Error};
 use proxmox_human_byte::HumanByte;
-use tracing::{info, warn};
+use tracing::{info, Level};
 
 use pbs_api_types::{
     print_store_and_ns, ArchiveType, Authid, BackupArchiveName, BackupDir, BackupGroup,
@@ -27,6 +27,7 @@ use pbs_datastore::manifest::{BackupManifest, FileInfo};
 use pbs_datastore::read_chunk::AsyncReadChunk;
 use pbs_datastore::{check_backup_owner, DataStore, DatastoreBackend, StoreProgress};
 use pbs_tools::bounded_join_set::BoundedJoinSet;
+use pbs_tools::buffered_logger::{BufferedLogger, LogLineSender};
 use pbs_tools::sha::sha256;
 
 use super::sync::{
@@ -153,6 +154,8 @@ async fn pull_index_chunks<I: IndexFile>(
     index: I,
     encountered_chunks: Arc<Mutex<EncounteredChunks>>,
     backend: &DatastoreBackend,
+    archive_prefix: &str,
+    log_sender: Arc<LogLineSender>,
 ) -> Result<SyncStats, Error> {
     use futures::stream::{self, StreamExt, TryStreamExt};
 
@@ -247,11 +250,16 @@ async fn pull_index_chunks<I: IndexFile>(
     let bytes = bytes.load(Ordering::SeqCst);
     let chunk_count = chunk_count.load(Ordering::SeqCst);
 
-    info!(
-        "downloaded {} ({}/s)",
-        HumanByte::from(bytes),
-        HumanByte::new_binary(bytes as f64 / elapsed.as_secs_f64()),
-    );
+    log_sender
+        .log(
+            Level::INFO,
+            format!(
+                "{archive_prefix}: downloaded {} ({}/s)",
+                HumanByte::from(bytes),
+                HumanByte::new_binary(bytes as f64 / elapsed.as_secs_f64()),
+            ),
+        )
+        .await?;
 
     Ok(SyncStats {
         chunk_count,
@@ -292,6 +300,7 @@ async fn pull_single_archive<'a>(
     archive_info: &'a FileInfo,
     encountered_chunks: Arc<Mutex<EncounteredChunks>>,
     backend: &DatastoreBackend,
+    log_sender: Arc<LogLineSender>,
 ) -> Result<SyncStats, Error> {
     let archive_name = &archive_info.filename;
     let mut path = snapshot.full_path();
@@ -302,72 +311,104 @@ async fn pull_single_archive<'a>(
 
     let mut sync_stats = SyncStats::default();
 
-    info!("sync archive {archive_name}");
+    let archive_prefix = format!("{}: {archive_name}", snapshot.backup_time_string());
+
+    log_sender
+        .log(Level::INFO, format!("{archive_prefix}: sync archive"))
+        .await?;
 
-    reader.load_file_into(archive_name, &tmp_path).await?;
+    reader
+        .load_file_into(archive_name, &tmp_path)
+        .await
+        .with_context(|| archive_prefix.clone())?;
 
-    let mut tmpfile = std::fs::OpenOptions::new().read(true).open(&tmp_path)?;
+    let mut tmpfile = std::fs::OpenOptions::new()
+        .read(true)
+        .open(&tmp_path)
+        .with_context(|| archive_prefix.clone())?;
 
     match ArchiveType::from_path(archive_name)? {
         ArchiveType::DynamicIndex => {
             let index = DynamicIndexReader::new(tmpfile).map_err(|err| {
-                format_err!("unable to read dynamic index {:?} - {}", tmp_path, err)
+                format_err!("{archive_prefix}: unable to read dynamic index {tmp_path:?} - {err}")
             })?;
             let (csum, size) = index.compute_csum();
-            verify_archive(archive_info, &csum, size)?;
+            verify_archive(archive_info, &csum, size).with_context(|| archive_prefix.clone())?;
 
             if reader.skip_chunk_sync(snapshot.datastore().name()) {
-                info!("skipping chunk sync for same datastore");
+                log_sender
+                    .log(
+                        Level::INFO,
+                        format!("{archive_prefix}: skipping chunk sync for same datastore"),
+                    )
+                    .await?;
             } else {
                 let stats = pull_index_chunks(
                     reader
                         .chunk_reader(archive_info.crypt_mode)
-                        .context("failed to get chunk reader")?,
+                        .context("failed to get chunk reader")
+                        .with_context(|| archive_prefix.clone())?,
                     snapshot.datastore().clone(),
                     index,
                     encountered_chunks,
                     backend,
+                    &archive_prefix,
+                    Arc::clone(&log_sender),
                 )
-                .await?;
+                .await
+                .with_context(|| archive_prefix.clone())?;
                 sync_stats.add(stats);
             }
         }
         ArchiveType::FixedIndex => {
             let index = FixedIndexReader::new(tmpfile).map_err(|err| {
-                format_err!("unable to read fixed index '{:?}' - {}", tmp_path, err)
+                format_err!("{archive_name}: unable to read fixed index '{tmp_path:?}' - {err}")
             })?;
             let (csum, size) = index.compute_csum();
-            verify_archive(archive_info, &csum, size)?;
+            verify_archive(archive_info, &csum, size).with_context(|| archive_prefix.clone())?;
 
             if reader.skip_chunk_sync(snapshot.datastore().name()) {
-                info!("skipping chunk sync for same datastore");
+                log_sender
+                    .log(
+                        Level::INFO,
+                        format!("{archive_prefix}: skipping chunk sync for same datastore"),
+                    )
+                    .await?;
             } else {
                 let stats = pull_index_chunks(
                     reader
                         .chunk_reader(archive_info.crypt_mode)
-                        .context("failed to get chunk reader")?,
+                        .context("failed to get chunk reader")
+                        .with_context(|| archive_prefix.clone())?,
                     snapshot.datastore().clone(),
                     index,
                     encountered_chunks,
                     backend,
+                    &archive_prefix,
+                    Arc::clone(&log_sender),
                 )
-                .await?;
+                .await
+                .with_context(|| archive_prefix.clone())?;
                 sync_stats.add(stats);
             }
         }
         ArchiveType::Blob => {
-            tmpfile.rewind()?;
-            let (csum, size) = sha256(&mut tmpfile)?;
-            verify_archive(archive_info, &csum, size)?;
+            proxmox_lang::try_block!({
+                tmpfile.rewind()?;
+                let (csum, size) = sha256(&mut tmpfile)?;
+                verify_archive(archive_info, &csum, size)
+            })
+            .with_context(|| archive_prefix.clone())?;
         }
     }
     if let Err(err) = std::fs::rename(&tmp_path, &path) {
-        bail!("Atomic rename file {:?} failed - {}", path, err);
+        bail!("{archive_prefix}: Atomic rename file {path:?} failed - {err}");
     }
 
     backend
         .upload_index_to_backend(snapshot, archive_name)
-        .await?;
+        .await
+        .with_context(|| archive_prefix.clone())?;
 
     Ok(sync_stats)
 }
@@ -388,13 +429,24 @@ async fn pull_snapshot<'a>(
     encountered_chunks: Arc<Mutex<EncounteredChunks>>,
     corrupt: bool,
     is_new: bool,
+    log_sender: Arc<LogLineSender>,
 ) -> Result<SyncStats, Error> {
+    let prefix = snapshot.backup_time_string().to_owned();
     if is_new {
-        info!("sync snapshot {}", snapshot.dir());
+        log_sender
+            .log(Level::INFO, format!("{prefix}: start sync"))
+            .await?;
     } else if corrupt {
-        info!("re-sync snapshot {} due to corruption", snapshot.dir());
+        log_sender
+            .log(
+                Level::INFO,
+                format!("re-sync snapshot {prefix} due to corruption"),
+            )
+            .await?;
     } else {
-        info!("re-sync snapshot {}", snapshot.dir());
+        log_sender
+            .log(Level::INFO, format!("re-sync snapshot {prefix}"))
+            .await?;
     }
 
     let mut sync_stats = SyncStats::default();
@@ -409,7 +461,8 @@ async fn pull_snapshot<'a>(
     let tmp_manifest_blob;
     if let Some(data) = reader
         .load_file_into(MANIFEST_BLOB_NAME.as_ref(), &tmp_manifest_name)
-        .await?
+        .await
+        .with_context(|| prefix.clone())?
     {
         tmp_manifest_blob = data;
     } else {
@@ -419,28 +472,34 @@ async fn pull_snapshot<'a>(
     if manifest_name.exists() && !corrupt {
         let manifest_blob = proxmox_lang::try_block!({
             let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
-                format_err!("unable to open local manifest {manifest_name:?} - {err}")
+                format_err!("{prefix}: unable to open local manifest {manifest_name:?} - {err}")
             })?;
 
-            let manifest_blob = DataBlob::load_from_reader(&mut manifest_file)?;
+            let manifest_blob =
+                DataBlob::load_from_reader(&mut manifest_file).with_context(|| prefix.clone())?;
             Ok(manifest_blob)
         })
         .map_err(|err: Error| {
-            format_err!("unable to read local manifest {manifest_name:?} - {err}")
+            format_err!("{prefix}: unable to read local manifest {manifest_name:?} - {err}")
         })?;
 
         if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
             if !client_log_name.exists() {
-                reader.try_download_client_log(&client_log_name).await?;
+                reader
+                    .try_download_client_log(&client_log_name)
+                    .await
+                    .with_context(|| prefix.clone())?;
             };
-            info!("no data changes");
+            log_sender
+                .log(Level::INFO, format!("{prefix}: no data changes"))
+                .await?;
             let _ = std::fs::remove_file(&tmp_manifest_name);
             return Ok(sync_stats); // nothing changed
         }
     }
 
     let manifest_data = tmp_manifest_blob.raw_data().to_vec();
-    let manifest = BackupManifest::try_from(tmp_manifest_blob)?;
+    let manifest = BackupManifest::try_from(tmp_manifest_blob).with_context(|| prefix.clone())?;
 
     if ignore_not_verified_or_encrypted(
         &manifest,
@@ -464,35 +523,54 @@ async fn pull_snapshot<'a>(
         path.push(&item.filename);
 
         if !corrupt && path.exists() {
-            let filename: BackupArchiveName = item.filename.as_str().try_into()?;
+            let filename: BackupArchiveName = item
+                .filename
+                .as_str()
+                .try_into()
+                .with_context(|| prefix.clone())?;
             match filename.archive_type() {
                 ArchiveType::DynamicIndex => {
-                    let index = DynamicIndexReader::open(&path)?;
+                    let index = DynamicIndexReader::open(&path).with_context(|| prefix.clone())?;
                     let (csum, size) = index.compute_csum();
                     match manifest.verify_file(&filename, &csum, size) {
                         Ok(_) => continue,
                         Err(err) => {
-                            info!("detected changed file {path:?} - {err}");
+                            log_sender
+                                .log(
+                                    Level::INFO,
+                                    format!("{prefix}: detected changed file {path:?} - {err}"),
+                                )
+                                .await?;
                         }
                     }
                 }
                 ArchiveType::FixedIndex => {
-                    let index = FixedIndexReader::open(&path)?;
+                    let index = FixedIndexReader::open(&path).with_context(|| prefix.clone())?;
                     let (csum, size) = index.compute_csum();
                     match manifest.verify_file(&filename, &csum, size) {
                         Ok(_) => continue,
                         Err(err) => {
-                            info!("detected changed file {path:?} - {err}");
+                            log_sender
+                                .log(
+                                    Level::INFO,
+                                    format!("{prefix}: detected changed file {path:?} - {err}"),
+                                )
+                                .await?;
                         }
                     }
                 }
                 ArchiveType::Blob => {
-                    let mut tmpfile = std::fs::File::open(&path)?;
-                    let (csum, size) = sha256(&mut tmpfile)?;
+                    let mut tmpfile = std::fs::File::open(&path).with_context(|| prefix.clone())?;
+                    let (csum, size) = sha256(&mut tmpfile).with_context(|| prefix.clone())?;
                     match manifest.verify_file(&filename, &csum, size) {
                         Ok(_) => continue,
                         Err(err) => {
-                            info!("detected changed file {path:?} - {err}");
+                            log_sender
+                                .log(
+                                    Level::INFO,
+                                    format!("{prefix}: detected changed file {path:?} - {err}"),
+                                )
+                                .await?;
                         }
                     }
                 }
@@ -505,13 +583,14 @@ async fn pull_snapshot<'a>(
             item,
             encountered_chunks.clone(),
             backend,
+            Arc::clone(&log_sender),
         )
         .await?;
         sync_stats.add(stats);
     }
 
     if let Err(err) = std::fs::rename(&tmp_manifest_name, &manifest_name) {
-        bail!("Atomic rename file {:?} failed - {}", manifest_name, err);
+        bail!("{prefix}: Atomic rename file {manifest_name:?} failed - {err}");
     }
     if let DatastoreBackend::S3(s3_client) = backend {
         let object_key = pbs_datastore::s3::object_key_from_path(
@@ -524,33 +603,40 @@ async fn pull_snapshot<'a>(
         let _is_duplicate = s3_client
             .upload_replace_with_retry(object_key, data)
             .await
-            .context("failed to upload manifest to s3 backend")?;
+            .context("failed to upload manifest to s3 backend")
+            .with_context(|| prefix.clone())?;
     }
 
     if !client_log_name.exists() {
-        reader.try_download_client_log(&client_log_name).await?;
+        reader
+            .try_download_client_log(&client_log_name)
+            .await
+            .with_context(|| prefix.clone())?;
         if client_log_name.exists() {
             if let DatastoreBackend::S3(s3_client) = backend {
                 let object_key = pbs_datastore::s3::object_key_from_path(
                     &snapshot.relative_path(),
                     CLIENT_LOG_BLOB_NAME.as_ref(),
                 )
-                .context("invalid archive object key")?;
+                .context("invalid archive object key")
+                .with_context(|| prefix.clone())?;
 
                 let data = tokio::fs::read(&client_log_name)
                     .await
-                    .context("failed to read log file contents")?;
+                    .context("failed to read log file contents")
+                    .with_context(|| prefix.clone())?;
                 let contents = hyper::body::Bytes::from(data);
                 let _is_duplicate = s3_client
                     .upload_replace_with_retry(object_key, contents)
                     .await
-                    .context("failed to upload client log to s3 backend")?;
+                    .context("failed to upload client log to s3 backend")
+                    .with_context(|| prefix.clone())?;
             }
         }
     };
     snapshot
         .cleanup_unreferenced_files(&manifest)
-        .map_err(|err| format_err!("failed to cleanup unreferenced files - {err}"))?;
+        .map_err(|err| format_err!("{prefix}: failed to cleanup unreferenced files - {err}"))?;
 
     Ok(sync_stats)
 }
@@ -565,10 +651,14 @@ async fn pull_snapshot_from<'a>(
     snapshot: &'a pbs_datastore::BackupDir,
     encountered_chunks: Arc<Mutex<EncounteredChunks>>,
     corrupt: bool,
+    log_sender: Arc<LogLineSender>,
 ) -> Result<SyncStats, Error> {
+    let prefix = format!("{}", snapshot.backup_time_string());
+
     let (_path, is_new, _snap_lock) = snapshot
         .datastore()
-        .create_locked_backup_dir(snapshot.backup_ns(), snapshot.as_ref())?;
+        .create_locked_backup_dir(snapshot.backup_ns(), snapshot.as_ref())
+        .context(prefix.clone())?;
 
     let result = pull_snapshot(
         params,
@@ -577,6 +667,7 @@ async fn pull_snapshot_from<'a>(
         encountered_chunks,
         corrupt,
         is_new,
+        Arc::clone(&log_sender),
     )
     .await;
 
@@ -589,11 +680,20 @@ async fn pull_snapshot_from<'a>(
                     snapshot.as_ref(),
                     true,
                 ) {
-                    info!("cleanup error - {cleanup_err}");
+                    log_sender
+                        .log(
+                            Level::INFO,
+                            format!("{prefix}: cleanup error - {cleanup_err}"),
+                        )
+                        .await?;
                 }
                 return Err(err);
             }
-            Ok(_) => info!("sync snapshot {} done", snapshot.dir()),
+            Ok(_) => {
+                log_sender
+                    .log(Level::INFO, format!("{prefix}: sync done"))
+                    .await?
+            }
         }
     }
 
@@ -622,7 +722,9 @@ async fn pull_group(
     source_namespace: &BackupNamespace,
     group: &BackupGroup,
     shared_group_progress: Arc<SharedGroupProgress>,
+    log_sender: Arc<LogLineSender>,
 ) -> Result<SyncStats, Error> {
+    let prefix = format!("{group}");
     let mut already_synced_skip_info = SkipInfo::new(SkipReason::AlreadySynced);
     let mut transfer_last_skip_info = SkipInfo::new(SkipReason::TransferLast);
 
@@ -714,11 +816,15 @@ async fn pull_group(
         .collect();
 
     if already_synced_skip_info.count > 0 {
-        info!("{already_synced_skip_info}");
+        log_sender
+            .log(Level::INFO, format!("{prefix}: {already_synced_skip_info}"))
+            .await?;
         already_synced_skip_info.reset();
     }
     if transfer_last_skip_info.count > 0 {
-        info!("{transfer_last_skip_info}");
+        log_sender
+            .log(Level::INFO, format!("{prefix}: {transfer_last_skip_info}"))
+            .await?;
         transfer_last_skip_info.reset();
     }
 
@@ -730,8 +836,8 @@ async fn pull_group(
         .store
         .backup_group(target_ns.clone(), group.clone());
     if let Some(info) = backup_group.last_backup(true).unwrap_or(None) {
-        let mut reusable_chunks = encountered_chunks.lock().unwrap();
         if let Err(err) = proxmox_lang::try_block!({
+            let mut reusable_chunks = encountered_chunks.lock().unwrap();
             let _snapshot_guard = info
                 .backup_dir
                 .lock_shared()
@@ -780,7 +886,12 @@ async fn pull_group(
             }
             Ok::<(), Error>(())
         }) {
-            warn!("Failed to collect reusable chunk from last backup: {err:#?}");
+            log_sender
+                .log(
+                    Level::WARN,
+                    format!("Failed to collect reusable chunk from last backup: {err:#?}"),
+                )
+                .await?;
         }
     }
 
@@ -805,13 +916,16 @@ async fn pull_group(
             &to_snapshot,
             encountered_chunks.clone(),
             corrupt,
+            Arc::clone(&log_sender),
         )
         .await;
 
         // Update done groups progress by other parallel running pulls
         local_progress.done_groups = shared_group_progress.load_done();
         local_progress.done_snapshots = pos as u64 + 1;
-        info!("percentage done: group {group}: {local_progress}");
+        log_sender
+            .log(Level::INFO, format!("percentage done: {local_progress}"))
+            .await?;
 
         let stats = result?; // stop on error
         sync_stats.add(stats);
@@ -829,13 +943,23 @@ async fn pull_group(
                 continue;
             }
             if snapshot.is_protected() {
-                info!(
-                    "don't delete vanished snapshot {} (protected)",
-                    snapshot.dir()
-                );
+                log_sender
+                    .log(
+                        Level::INFO,
+                        format!(
+                            "{prefix}: don't delete vanished snapshot {} (protected)",
+                            snapshot.dir(),
+                        ),
+                    )
+                    .await?;
                 continue;
             }
-            info!("delete vanished snapshot {}", snapshot.dir());
+            log_sender
+                .log(
+                    Level::INFO,
+                    format!("delete vanished snapshot {}", snapshot.dir()),
+                )
+                .await?;
             params
                 .target
                 .store
@@ -1035,10 +1159,7 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
             }
             Err(err) => {
                 errors = true;
-                info!(
-                    "Encountered errors while syncing namespace {} - {err}",
-                    &namespace,
-                );
+                info!("Encountered errors while syncing namespace {namespace} - {err}");
             }
         };
     }
@@ -1064,6 +1185,7 @@ async fn lock_and_pull_group(
     namespace: &BackupNamespace,
     target_namespace: &BackupNamespace,
     shared_group_progress: Arc<SharedGroupProgress>,
+    log_sender: Arc<LogLineSender>,
 ) -> Result<SyncStats, Error> {
     let (owner, _lock_guard) =
         match params
@@ -1073,25 +1195,47 @@ async fn lock_and_pull_group(
         {
             Ok(res) => res,
             Err(err) => {
-                info!("sync group {group} failed - group lock failed: {err}");
-                info!("create_locked_backup_group failed");
+                log_sender
+                    .log(
+                        Level::INFO,
+                        format!("sync group {group} failed - group lock failed: {err}"),
+                    )
+                    .await?;
+                log_sender
+                    .log(Level::INFO, "create_locked_backup_group failed".to_string())
+                    .await?;
                 return Err(err);
             }
         };
 
     if params.owner != owner {
         // only the owner is allowed to create additional snapshots
-        info!(
-            "sync group {group} failed - owner check failed ({} != {owner})",
-            params.owner
-        );
+        log_sender
+            .log(
+                Level::INFO,
+                format!(
+                    "sync group {group} failed - owner check failed ({} != {owner})",
+                    params.owner,
+                ),
+            )
+            .await?;
         return Err(format_err!("owner check failed"));
     }
 
-    match pull_group(params, namespace, group, shared_group_progress).await {
+    match pull_group(
+        params,
+        namespace,
+        group,
+        shared_group_progress,
+        Arc::clone(&log_sender),
+    )
+    .await
+    {
         Ok(stats) => Ok(stats),
         Err(err) => {
-            info!("sync group {group} failed - {err:#}");
+            log_sender
+                .log(Level::INFO, format!("sync group {group} failed - {err:#}"))
+                .await?;
             Err(err)
         }
     }
@@ -1124,7 +1268,7 @@ async fn pull_ns(
     list.sort_unstable();
 
     info!(
-        "found {} groups to sync (out of {unfiltered_count} total)",
+        "Found {} groups to sync (out of {unfiltered_count} total)",
         list.len()
     );
 
@@ -1143,6 +1287,10 @@ async fn pull_ns(
     let shared_group_progress = Arc::new(SharedGroupProgress::with_total_groups(list.len()));
     let mut group_workers = BoundedJoinSet::new(params.worker_threads.unwrap_or(1));
 
+    let (buffered_logger, sender_builder) = BufferedLogger::new(5, Duration::from_secs(1));
+    // runs until sender_builder and all senders build from it are being dropped
+    buffered_logger.run_log_collection();
+
     let mut process_results = |results| {
         for result in results {
             match result {
@@ -1160,16 +1308,20 @@ async fn pull_ns(
         let target_ns = target_ns.clone();
         let params = Arc::clone(&params);
         let group_progress_cloned = Arc::clone(&shared_group_progress);
+        let log_sender = Arc::new(sender_builder.sender_with_label(group.to_string()));
         let results = group_workers
             .spawn_task(async move {
-                lock_and_pull_group(
+                let result = lock_and_pull_group(
                     Arc::clone(&params),
                     &group,
                     &namespace,
                     &target_ns,
                     group_progress_cloned,
+                    Arc::clone(&log_sender),
                 )
-                .await
+                .await;
+                let _ = log_sender.flush().await;
+                result
             })
             .await
             .map_err(|err| format_err!("failed to join on worker task: {err:#}"))?;
@@ -1197,7 +1349,7 @@ async fn pull_ns(
                 if !local_group.apply_filters(&params.group_filter) {
                     continue;
                 }
-                info!("delete vanished group '{local_group}'");
+                info!("Delete vanished group '{local_group}'");
                 let delete_stats_result = params
                     .target
                     .store
@@ -1206,7 +1358,7 @@ async fn pull_ns(
                 match delete_stats_result {
                     Ok(stats) => {
                         if !stats.all_removed() {
-                            info!("kept some protected snapshots of group '{local_group}'");
+                            info!("Kept some protected snapshots of group '{local_group}'");
                             sync_stats.add(SyncStats::from(RemovedVanishedStats {
                                 snapshots: stats.removed_snapshots(),
                                 groups: 0,
@@ -1229,7 +1381,7 @@ async fn pull_ns(
             Ok(())
         });
         if let Err(err) = result {
-            info!("error during cleanup: {err}");
+            info!("Error during cleanup: {err}");
             errors = true;
         };
     }
diff --git a/src/server/sync.rs b/src/server/sync.rs
index e88418442..17ed4839f 100644
--- a/src/server/sync.rs
+++ b/src/server/sync.rs
@@ -13,7 +13,6 @@ use futures::{future::FutureExt, select};
 use hyper::http::StatusCode;
 use pbs_config::BackupLockGuard;
 use serde_json::json;
-use tokio::task::JoinSet;
 use tracing::{info, warn};
 
 use proxmox_human_byte::HumanByte;
@@ -136,13 +135,13 @@ impl SyncSourceReader for RemoteSourceReader {
                 Some(HttpError { code, message }) => match *code {
                     StatusCode::NOT_FOUND => {
                         info!(
-                            "skipping snapshot {} - vanished since start of sync",
+                            "Snapshot {}: skipped because vanished since start of sync",
                             &self.dir
                         );
                         return Ok(None);
                     }
                     _ => {
-                        bail!("HTTP error {code} - {message}");
+                        bail!("Snapshot {}: HTTP error {code} - {message}", &self.dir);
                     }
                 },
                 None => {
@@ -176,7 +175,8 @@ impl SyncSourceReader for RemoteSourceReader {
                 bail!("Atomic rename file {to_path:?} failed - {err}");
             }
             info!(
-                "got backup log file {client_log_name}",
+                "Snapshot {snapshot}: got backup log file {client_log_name}",
+                snapshot = &self.dir,
                 client_log_name = client_log_name.deref()
             );
         }
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 12/15] sync: push: prepare push parameters to be shared across parallel tasks
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (10 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 13/15] server: sync: allow pushing groups concurrently Christian Ebner
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

When performing parallel group syncs, the push parameters must be
shared between all tasks which is not possible with regular
references due to lifetime and ownership issues. Pack them into an
atomic reference counter instead so they can easily be cloned when
required.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- no changes

 src/server/push.rs | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/src/server/push.rs b/src/server/push.rs
index 44a204e6b..14395fe61 100644
--- a/src/server/push.rs
+++ b/src/server/push.rs
@@ -405,6 +405,7 @@ pub(crate) async fn push_store(mut params: PushParameters) -> Result<SyncStats,
 
     let (mut groups, mut snapshots) = (0, 0);
     let mut stats = SyncStats::default();
+    let params = Arc::new(params);
     for source_namespace in &source_namespaces {
         let source_store_and_ns = print_store_and_ns(params.source.store.name(), source_namespace);
         let target_namespace = params.map_to_target(source_namespace)?;
@@ -428,7 +429,7 @@ pub(crate) async fn push_store(mut params: PushParameters) -> Result<SyncStats,
             continue;
         }
 
-        match push_namespace(source_namespace, &params).await {
+        match push_namespace(source_namespace, Arc::clone(&params)).await {
             Ok((sync_progress, sync_stats, sync_errors)) => {
                 errors |= sync_errors;
                 stats.add(sync_stats);
@@ -523,11 +524,11 @@ pub(crate) async fn push_store(mut params: PushParameters) -> Result<SyncStats,
 /// Iterate over all backup groups in the namespace and push them to the target.
 pub(crate) async fn push_namespace(
     namespace: &BackupNamespace,
-    params: &PushParameters,
+    params: Arc<PushParameters>,
 ) -> Result<(StoreProgress, SyncStats, bool), Error> {
     let target_namespace = params.map_to_target(namespace)?;
     // Check if user is allowed to perform backups on remote datastore
-    check_ns_remote_datastore_privs(params, &target_namespace, PRIV_REMOTE_DATASTORE_BACKUP)
+    check_ns_remote_datastore_privs(&params, &target_namespace, PRIV_REMOTE_DATASTORE_BACKUP)
         .context("Pushing to remote namespace not allowed")?;
 
     let mut list: Vec<BackupGroup> = params
@@ -555,7 +556,7 @@ pub(crate) async fn push_namespace(
     let mut stats = SyncStats::default();
 
     let (owned_target_groups, not_owned_target_groups) =
-        fetch_target_groups(params, &target_namespace).await?;
+        fetch_target_groups(&params, &target_namespace).await?;
 
     for (done, group) in list.into_iter().enumerate() {
         progress.done_groups = done as u64;
@@ -571,7 +572,7 @@ pub(crate) async fn push_namespace(
         }
         synced_groups.insert(group.clone());
 
-        match push_group(params, namespace, &group, &mut progress).await {
+        match push_group(Arc::clone(&params), namespace, &group, &mut progress).await {
             Ok(sync_stats) => stats.add(sync_stats),
             Err(err) => {
                 warn!("Encountered errors: {err:#}");
@@ -591,7 +592,7 @@ pub(crate) async fn push_namespace(
                 continue;
             }
 
-            match remove_target_group(params, &target_namespace, &target_group).await {
+            match remove_target_group(&params, &target_namespace, &target_group).await {
                 Ok(delete_stats) => {
                     info!("Removed vanished group {target_group} from remote");
                     if delete_stats.protected_snapshots() > 0 {
@@ -673,7 +674,7 @@ async fn forget_target_snapshot(
 /// - Iterate the snapshot list and push each snapshot individually
 /// - (Optional): Remove vanished groups if `remove_vanished` flag is set
 pub(crate) async fn push_group(
-    params: &PushParameters,
+    params: Arc<PushParameters>,
     namespace: &BackupNamespace,
     group: &BackupGroup,
     progress: &mut StoreProgress,
@@ -692,7 +693,7 @@ pub(crate) async fn push_group(
     }
 
     let target_namespace = params.map_to_target(namespace)?;
-    let mut target_snapshots = fetch_target_snapshots(params, &target_namespace, group).await?;
+    let mut target_snapshots = fetch_target_snapshots(&params, &target_namespace, group).await?;
     target_snapshots.sort_unstable_by_key(|a| a.backup.time);
 
     let last_snapshot_time = target_snapshots
@@ -749,8 +750,13 @@ pub(crate) async fn push_group(
     let mut stats = SyncStats::default();
     let mut fetch_previous_manifest = !target_snapshots.is_empty();
     for (pos, source_snapshot) in snapshots.into_iter().enumerate() {
-        let result =
-            push_snapshot(params, namespace, &source_snapshot, fetch_previous_manifest).await;
+        let result = push_snapshot(
+            &params,
+            namespace,
+            &source_snapshot,
+            fetch_previous_manifest,
+        )
+        .await;
         fetch_previous_manifest = true;
 
         progress.done_snapshots = pos as u64 + 1;
@@ -773,7 +779,7 @@ pub(crate) async fn push_group(
                 );
                 continue;
             }
-            match forget_target_snapshot(params, &target_namespace, &snapshot.backup).await {
+            match forget_target_snapshot(&params, &target_namespace, &snapshot.backup).await {
                 Ok(()) => {
                     info!(
                         "Removed vanished snapshot {name} from remote",
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 13/15] server: sync: allow pushing groups concurrently
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (11 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 12/15] sync: push: prepare push parameters to be shared across parallel tasks Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 14/15] server: push: prefix log messages and add additional logging Christian Ebner
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Improve the throughput over high latency connections for sync jobs in
push direction by allowing to push up to a configured number of
backup groups concurrently. Just like for pull sync jobs, use an
bounded join set to run up to the configured number of group worker
tokio tasks in parallel, each connecting and pushing a group to
the reomte target.

The store progress and sync group housekeeping are placed behind a
atomic reference counted mutex to allow for concurrent access of
status updates.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- uses BoundedJoinSet implementation, refactored accordingly

 src/server/push.rs | 102 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 77 insertions(+), 25 deletions(-)

diff --git a/src/server/push.rs b/src/server/push.rs
index 14395fe61..9b7fb4522 100644
--- a/src/server/push.rs
+++ b/src/server/push.rs
@@ -27,6 +27,7 @@ use pbs_datastore::fixed_index::FixedIndexReader;
 use pbs_datastore::index::IndexFile;
 use pbs_datastore::read_chunk::AsyncReadChunk;
 use pbs_datastore::{DataStore, StoreProgress};
+use pbs_tools::bounded_join_set::BoundedJoinSet;
 
 use super::sync::{
     check_namespace_depth_limit, exclude_not_verified_or_encrypted,
@@ -34,6 +35,7 @@ use super::sync::{
     SyncSource, SyncStats,
 };
 use crate::api2::config::remote;
+use crate::server::sync::SharedGroupProgress;
 
 /// Target for backups to be pushed to
 pub(crate) struct PushTarget {
@@ -551,41 +553,62 @@ pub(crate) async fn push_namespace(
 
     let mut errors = false;
     // Remember synced groups, remove others when the remove vanished flag is set
-    let mut synced_groups = HashSet::new();
+    let synced_groups = Arc::new(Mutex::new(HashSet::new()));
     let mut progress = StoreProgress::new(list.len() as u64);
     let mut stats = SyncStats::default();
 
     let (owned_target_groups, not_owned_target_groups) =
         fetch_target_groups(&params, &target_namespace).await?;
+    let not_owned_target_groups = Arc::new(not_owned_target_groups);
 
-    for (done, group) in list.into_iter().enumerate() {
-        progress.done_groups = done as u64;
-        progress.done_snapshots = 0;
-        progress.group_snapshots = 0;
+    let mut group_workers = BoundedJoinSet::new(params.worker_threads.unwrap_or(1));
+    let shared_group_progress = Arc::new(SharedGroupProgress::with_total_groups(list.len()));
 
-        if not_owned_target_groups.contains(&group) {
-            warn!(
-                "Group '{group}' not owned by remote user '{}' on target, skipping upload",
-                params.target.remote_user(),
-            );
-            continue;
-        }
-        synced_groups.insert(group.clone());
-
-        match push_group(Arc::clone(&params), namespace, &group, &mut progress).await {
-            Ok(sync_stats) => stats.add(sync_stats),
-            Err(err) => {
-                warn!("Encountered errors: {err:#}");
-                warn!("Failed to push group {group} to remote!");
-                errors = true;
+    let mut process_results = |results| {
+        for result in results {
+            match result {
+                Ok(sync_stats) => {
+                    stats.add(sync_stats);
+                    progress.done_groups = shared_group_progress.increment_done();
+                }
+                Err(()) => errors = true,
             }
         }
+    };
+
+    for group in list.into_iter() {
+        let namespace = namespace.clone();
+        let params = Arc::clone(&params);
+        let not_owned_target_groups = Arc::clone(&not_owned_target_groups);
+        let synced_groups = Arc::clone(&synced_groups);
+        let group_progress_cloned = Arc::clone(&shared_group_progress);
+        let results = group_workers
+            .spawn_task(async move {
+                push_group_do(
+                    params,
+                    &namespace,
+                    &group,
+                    group_progress_cloned,
+                    synced_groups,
+                    not_owned_target_groups,
+                )
+                .await
+            })
+            .await
+            .map_err(|err| format_err!("failed to join on worker task: {err:#}"))?;
+        process_results(results);
     }
 
+    let results = group_workers
+        .join_active()
+        .await
+        .map_err(|err| format_err!("failed to join on worker task: {err:#}"))?;
+    process_results(results);
+
     if params.remove_vanished {
         // only ever allow to prune owned groups on target
         for target_group in owned_target_groups {
-            if synced_groups.contains(&target_group) {
+            if synced_groups.lock().unwrap().contains(&target_group) {
                 continue;
             }
             if !target_group.apply_filters(&params.group_filter) {
@@ -664,6 +687,32 @@ async fn forget_target_snapshot(
     Ok(())
 }
 
+async fn push_group_do(
+    params: Arc<PushParameters>,
+    namespace: &BackupNamespace,
+    group: &BackupGroup,
+    shared_group_progress: Arc<SharedGroupProgress>,
+    synced_groups: Arc<Mutex<HashSet<BackupGroup>>>,
+    not_owned_target_groups: Arc<HashSet<BackupGroup>>,
+) -> Result<SyncStats, ()> {
+    if not_owned_target_groups.contains(group) {
+        warn!(
+            "Group '{group}' not owned by remote user '{}' on target, skipping upload",
+            params.target.remote_user(),
+        );
+        shared_group_progress.increment_done();
+        return Ok(SyncStats::default());
+    }
+
+    synced_groups.lock().unwrap().insert(group.clone());
+    push_group(params, namespace, group, Arc::clone(&shared_group_progress))
+        .await
+        .map_err(|err| {
+            warn!("Group {group}: Encountered errors: {err:#}");
+            warn!("Failed to push group {group} to remote!");
+        })
+}
+
 /// Push group including all snaphshots to target
 ///
 /// Iterate over all snapshots in the group and push them to the target.
@@ -677,7 +726,7 @@ pub(crate) async fn push_group(
     params: Arc<PushParameters>,
     namespace: &BackupNamespace,
     group: &BackupGroup,
-    progress: &mut StoreProgress,
+    shared_group_progress: Arc<SharedGroupProgress>,
 ) -> Result<SyncStats, Error> {
     let mut already_synced_skip_info = SkipInfo::new(SkipReason::AlreadySynced);
     let mut transfer_last_skip_info = SkipInfo::new(SkipReason::TransferLast);
@@ -745,7 +794,8 @@ pub(crate) async fn push_group(
         transfer_last_skip_info.reset();
     }
 
-    progress.group_snapshots = snapshots.len() as u64;
+    let mut local_progress = StoreProgress::new(shared_group_progress.total_groups());
+    local_progress.group_snapshots = snapshots.len() as u64;
 
     let mut stats = SyncStats::default();
     let mut fetch_previous_manifest = !target_snapshots.is_empty();
@@ -759,8 +809,10 @@ pub(crate) async fn push_group(
         .await;
         fetch_previous_manifest = true;
 
-        progress.done_snapshots = pos as u64 + 1;
-        info!("Percentage done: {progress}");
+        // Update done groups progress by other parallel running pushes
+        local_progress.done_groups = shared_group_progress.load_done();
+        local_progress.done_snapshots = pos as u64 + 1;
+        info!("Percentage done: group {group}: {local_progress}");
 
         // stop on error
         let sync_stats = result?;
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 14/15] server: push: prefix log messages and add additional logging
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (12 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 13/15] server: sync: allow pushing groups concurrently Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 15/15] ui: expose group worker setting in sync job edit window Christian Ebner
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Pushing groups and therefore also snapshots in parallel leads to
unordered log outputs, making it mostly impossible to relate a log
message to a backup snapshot/group.

Therefore, prefix push job log messages by the corresponding group or
snapshot and use the buffered logger implementation to buffer up to 5
lines subsequent lines with a timeout of 1 second. This reduces
interwoven log messages stemming from different groups.

Also, be more verbose for push syncs, adding additional log output
for the groups, snapshots and archives being pushed.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- uses BufferedLogger implementation, refactored accordingly
- improve log line prefixes
- add missing error contexts

 src/server/push.rs | 245 ++++++++++++++++++++++++++++++++++++---------
 1 file changed, 199 insertions(+), 46 deletions(-)

diff --git a/src/server/push.rs b/src/server/push.rs
index 9b7fb4522..520bdd250 100644
--- a/src/server/push.rs
+++ b/src/server/push.rs
@@ -2,12 +2,13 @@
 
 use std::collections::HashSet;
 use std::sync::{Arc, Mutex};
+use std::time::Duration;
 
 use anyhow::{bail, format_err, Context, Error};
 use futures::stream::{self, StreamExt, TryStreamExt};
 use tokio::sync::mpsc;
 use tokio_stream::wrappers::ReceiverStream;
-use tracing::{info, warn};
+use tracing::{info, warn, Level};
 
 use pbs_api_types::{
     print_store_and_ns, ApiVersion, ApiVersionInfo, ArchiveType, Authid, BackupArchiveName,
@@ -28,6 +29,9 @@ use pbs_datastore::index::IndexFile;
 use pbs_datastore::read_chunk::AsyncReadChunk;
 use pbs_datastore::{DataStore, StoreProgress};
 use pbs_tools::bounded_join_set::BoundedJoinSet;
+use pbs_tools::buffered_logger::{BufferedLogger, LogLineSender};
+
+use proxmox_human_byte::HumanByte;
 
 use super::sync::{
     check_namespace_depth_limit, exclude_not_verified_or_encrypted,
@@ -564,6 +568,10 @@ pub(crate) async fn push_namespace(
     let mut group_workers = BoundedJoinSet::new(params.worker_threads.unwrap_or(1));
     let shared_group_progress = Arc::new(SharedGroupProgress::with_total_groups(list.len()));
 
+    let (buffered_logger, sender_builder) = BufferedLogger::new(5, Duration::from_secs(1));
+    // runs until sender_builder and all senders build from it are being dropped
+    buffered_logger.run_log_collection();
+
     let mut process_results = |results| {
         for result in results {
             match result {
@@ -571,7 +579,7 @@ pub(crate) async fn push_namespace(
                     stats.add(sync_stats);
                     progress.done_groups = shared_group_progress.increment_done();
                 }
-                Err(()) => errors = true,
+                Err(_err) => errors = true,
             }
         }
     };
@@ -582,17 +590,21 @@ pub(crate) async fn push_namespace(
         let not_owned_target_groups = Arc::clone(&not_owned_target_groups);
         let synced_groups = Arc::clone(&synced_groups);
         let group_progress_cloned = Arc::clone(&shared_group_progress);
+        let log_sender = Arc::new(sender_builder.sender_with_label(group.to_string()));
         let results = group_workers
             .spawn_task(async move {
-                push_group_do(
+                let result = push_group_do(
                     params,
                     &namespace,
                     &group,
                     group_progress_cloned,
                     synced_groups,
                     not_owned_target_groups,
+                    Arc::clone(&log_sender),
                 )
-                .await
+                .await;
+                let _ = log_sender.flush().await;
+                result
             })
             .await
             .map_err(|err| format_err!("failed to join on worker task: {err:#}"))?;
@@ -694,23 +706,46 @@ async fn push_group_do(
     shared_group_progress: Arc<SharedGroupProgress>,
     synced_groups: Arc<Mutex<HashSet<BackupGroup>>>,
     not_owned_target_groups: Arc<HashSet<BackupGroup>>,
-) -> Result<SyncStats, ()> {
+    log_sender: Arc<LogLineSender>,
+) -> Result<SyncStats, Error> {
     if not_owned_target_groups.contains(group) {
-        warn!(
-            "Group '{group}' not owned by remote user '{}' on target, skipping upload",
-            params.target.remote_user(),
-        );
+        log_sender
+            .log(
+                Level::WARN,
+                format!(
+                    "Group '{group}' not owned by remote user '{}' on target, skipping upload",
+                    params.target.remote_user(),
+                ),
+            )
+            .await?;
         shared_group_progress.increment_done();
         return Ok(SyncStats::default());
     }
 
     synced_groups.lock().unwrap().insert(group.clone());
-    push_group(params, namespace, group, Arc::clone(&shared_group_progress))
-        .await
-        .map_err(|err| {
-            warn!("Group {group}: Encountered errors: {err:#}");
-            warn!("Failed to push group {group} to remote!");
-        })
+    match push_group(
+        params,
+        namespace,
+        group,
+        Arc::clone(&shared_group_progress),
+        Arc::clone(&log_sender),
+    )
+    .await
+    {
+        Ok(res) => Ok(res),
+        Err(err) => {
+            log_sender
+                .log(Level::WARN, format!("Encountered errors: {err:#}"))
+                .await?;
+            log_sender
+                .log(
+                    Level::WARN,
+                    format!("Failed to push group {group} to remote!"),
+                )
+                .await?;
+            Err(err)
+        }
+    }
 }
 
 /// Push group including all snaphshots to target
@@ -727,6 +762,7 @@ pub(crate) async fn push_group(
     namespace: &BackupNamespace,
     group: &BackupGroup,
     shared_group_progress: Arc<SharedGroupProgress>,
+    log_sender: Arc<LogLineSender>,
 ) -> Result<SyncStats, Error> {
     let mut already_synced_skip_info = SkipInfo::new(SkipReason::AlreadySynced);
     let mut transfer_last_skip_info = SkipInfo::new(SkipReason::TransferLast);
@@ -738,7 +774,12 @@ pub(crate) async fn push_group(
     snapshots.sort_unstable_by_key(|a| a.backup.time);
 
     if snapshots.is_empty() {
-        info!("Group '{group}' contains no snapshots to sync to remote");
+        log_sender
+            .log(
+                Level::INFO,
+                format!("Group '{group}' contains no snapshots to sync to remote"),
+            )
+            .await?;
     }
 
     let target_namespace = params.map_to_target(namespace)?;
@@ -786,11 +827,15 @@ pub(crate) async fn push_group(
         .collect();
 
     if already_synced_skip_info.count > 0 {
-        info!("{already_synced_skip_info}");
+        log_sender
+            .log(Level::INFO, already_synced_skip_info.to_string())
+            .await?;
         already_synced_skip_info.reset();
     }
     if transfer_last_skip_info.count > 0 {
-        info!("{transfer_last_skip_info}");
+        log_sender
+            .log(Level::INFO, transfer_last_skip_info.to_string())
+            .await?;
         transfer_last_skip_info.reset();
     }
 
@@ -800,11 +845,18 @@ pub(crate) async fn push_group(
     let mut stats = SyncStats::default();
     let mut fetch_previous_manifest = !target_snapshots.is_empty();
     for (pos, source_snapshot) in snapshots.into_iter().enumerate() {
+        let prefix = proxmox_time::epoch_to_rfc3339_utc(source_snapshot.time)
+            .context("invalid timestamp")?;
+        log_sender
+            .log(Level::INFO, format!("{prefix}: start sync"))
+            .await?;
         let result = push_snapshot(
             &params,
             namespace,
             &source_snapshot,
             fetch_previous_manifest,
+            Arc::clone(&log_sender),
+            &prefix,
         )
         .await;
         fetch_previous_manifest = true;
@@ -812,10 +864,18 @@ pub(crate) async fn push_group(
         // Update done groups progress by other parallel running pushes
         local_progress.done_groups = shared_group_progress.load_done();
         local_progress.done_snapshots = pos as u64 + 1;
-        info!("Percentage done: group {group}: {local_progress}");
 
         // stop on error
         let sync_stats = result?;
+        log_sender
+            .log(Level::INFO, format!("{prefix}: sync done"))
+            .await?;
+        log_sender
+            .log(
+                Level::INFO,
+                format!("Percentage done: group {group}: {local_progress}"),
+            )
+            .await?;
         stats.add(sync_stats);
     }
 
@@ -825,25 +885,42 @@ pub(crate) async fn push_group(
                 continue;
             }
             if snapshot.protected {
-                info!(
-                    "Kept protected snapshot {name} on remote",
-                    name = snapshot.backup
-                );
+                log_sender
+                    .log(
+                        Level::INFO,
+                        format!(
+                            "Kept protected snapshot {name} on remote",
+                            name = snapshot.backup
+                        ),
+                    )
+                    .await?;
                 continue;
             }
             match forget_target_snapshot(&params, &target_namespace, &snapshot.backup).await {
                 Ok(()) => {
-                    info!(
-                        "Removed vanished snapshot {name} from remote",
-                        name = snapshot.backup
-                    );
+                    log_sender
+                        .log(
+                            Level::INFO,
+                            format!(
+                                "Removed vanished snapshot {name} from remote",
+                                name = snapshot.backup
+                            ),
+                        )
+                        .await?;
                 }
                 Err(err) => {
-                    warn!("Encountered errors: {err:#}");
-                    warn!(
-                        "Failed to remove vanished snapshot {name} from remote!",
-                        name = snapshot.backup
-                    );
+                    log_sender
+                        .log(Level::WARN, format!("Encountered errors: {err:#}"))
+                        .await?;
+                    log_sender
+                        .log(
+                            Level::WARN,
+                            format!(
+                                "Failed to remove vanished snapshot {name} from remote!",
+                                name = snapshot.backup
+                            ),
+                        )
+                        .await?;
                 }
             }
             stats.add(SyncStats::from(RemovedVanishedStats {
@@ -868,24 +945,40 @@ pub(crate) async fn push_snapshot(
     namespace: &BackupNamespace,
     snapshot: &BackupDir,
     fetch_previous_manifest: bool,
+    log_sender: Arc<LogLineSender>,
+    prefix: &String,
 ) -> Result<SyncStats, Error> {
     let mut stats = SyncStats::default();
-    let target_ns = params.map_to_target(namespace)?;
+    let target_ns = params
+        .map_to_target(namespace)
+        .with_context(|| prefix.clone())?;
     let backup_dir = params
         .source
         .store
-        .backup_dir(namespace.clone(), snapshot.clone())?;
+        .backup_dir(namespace.clone(), snapshot.clone())
+        .with_context(|| prefix.clone())?;
 
     // Reader locks the snapshot
-    let reader = params.source.reader(namespace, snapshot).await?;
+    let reader = params
+        .source
+        .reader(namespace, snapshot)
+        .await
+        .with_context(|| prefix.clone())?;
 
     // Does not lock the manifest, but the reader already assures a locked snapshot
     let source_manifest = match backup_dir.load_manifest() {
         Ok((manifest, _raw_size)) => manifest,
         Err(err) => {
             // No manifest in snapshot or failed to read, warn and skip
-            log::warn!("Encountered errors: {err:#}");
-            log::warn!("Failed to load manifest for '{snapshot}'!");
+            log_sender
+                .log(
+                    Level::WARN,
+                    format!("{prefix}: Encountered errors: {err:#}"),
+                )
+                .await?;
+            log_sender
+                .log(Level::WARN, format!("{prefix}: Failed to load manifest!"))
+                .await?;
             return Ok(stats);
         }
     };
@@ -912,14 +1005,22 @@ pub(crate) async fn push_snapshot(
             no_cache: false,
         },
     )
-    .await?;
+    .await
+    .with_context(|| prefix.clone())?;
 
     let mut previous_manifest = None;
     // Use manifest of previous snapshots in group on target for chunk upload deduplication
     if fetch_previous_manifest {
         match backup_writer.download_previous_manifest().await {
             Ok(manifest) => previous_manifest = Some(Arc::new(manifest)),
-            Err(err) => log::info!("Could not download previous manifest - {err}"),
+            Err(err) => {
+                log_sender
+                    .log(
+                        Level::INFO,
+                        format!("{prefix}: Could not download previous manifest - {err}"),
+                    )
+                    .await?
+            }
         }
     };
 
@@ -948,12 +1049,32 @@ pub(crate) async fn push_snapshot(
         path.push(&entry.filename);
         if path.try_exists()? {
             let archive_name = BackupArchiveName::from_path(&entry.filename)?;
+            log_sender
+                .log(
+                    Level::INFO,
+                    format!("{prefix}: sync archive {archive_name}"),
+                )
+                .await?;
+            let archive_prefix = format!("{prefix}: {archive_name}");
             match archive_name.archive_type() {
                 ArchiveType::Blob => {
                     let file = std::fs::File::open(&path)?;
                     let backup_stats = backup_writer
                         .upload_blob(file, archive_name.as_ref())
                         .await?;
+                    log_sender
+                        .log(
+                            Level::INFO,
+                            format!(
+                                "{archive_prefix}: uploaded {} ({}/s)",
+                                HumanByte::from(backup_stats.size),
+                                HumanByte::new_binary(
+                                    backup_stats.size as f64 / backup_stats.duration.as_secs_f64()
+                                ),
+                            ),
+                        )
+                        .await
+                        .with_context(|| archive_prefix.clone())?;
                     stats.add(SyncStats {
                         chunk_count: backup_stats.chunk_count as usize,
                         bytes: backup_stats.size as usize,
@@ -972,7 +1093,7 @@ pub(crate) async fn push_snapshot(
                             )
                             .await;
                     }
-                    let index = DynamicIndexReader::open(&path)?;
+                    let index = DynamicIndexReader::open(&path).with_context(|| prefix.clone())?;
                     let chunk_reader = reader
                         .chunk_reader(entry.chunk_crypt_mode())
                         .context("failed to get chunk reader")?;
@@ -984,7 +1105,20 @@ pub(crate) async fn push_snapshot(
                         IndexType::Dynamic,
                         known_chunks.clone(),
                     )
-                    .await?;
+                    .await
+                    .with_context(|| archive_prefix.clone())?;
+                    log_sender
+                        .log(
+                            Level::INFO,
+                            format!(
+                                "{archive_prefix}: uploaded {} ({}/s)",
+                                HumanByte::from(sync_stats.bytes),
+                                HumanByte::new_binary(
+                                    sync_stats.bytes as f64 / sync_stats.elapsed.as_secs_f64()
+                                ),
+                            ),
+                        )
+                        .await?;
                     stats.add(sync_stats);
                 }
                 ArchiveType::FixedIndex => {
@@ -1001,7 +1135,8 @@ pub(crate) async fn push_snapshot(
                     let index = FixedIndexReader::open(&path)?;
                     let chunk_reader = reader
                         .chunk_reader(entry.chunk_crypt_mode())
-                        .context("failed to get chunk reader")?;
+                        .context("failed to get chunk reader")
+                        .with_context(|| archive_prefix.clone())?;
                     let size = index.index_bytes();
                     let sync_stats = push_index(
                         &archive_name,
@@ -1011,7 +1146,20 @@ pub(crate) async fn push_snapshot(
                         IndexType::Fixed(Some(size)),
                         known_chunks.clone(),
                     )
-                    .await?;
+                    .await
+                    .with_context(|| archive_prefix.clone())?;
+                    log_sender
+                        .log(
+                            Level::INFO,
+                            format!(
+                                "{archive_prefix}: uploaded {} ({}/s)",
+                                HumanByte::from(sync_stats.bytes),
+                                HumanByte::new_binary(
+                                    sync_stats.bytes as f64 / sync_stats.elapsed.as_secs_f64()
+                                ),
+                            ),
+                        )
+                        .await?;
                     stats.add(sync_stats);
                 }
             }
@@ -1032,7 +1180,8 @@ pub(crate) async fn push_snapshot(
                 client_log_name.as_ref(),
                 upload_options.clone(),
             )
-            .await?;
+            .await
+            .with_context(|| prefix.clone())?;
     }
 
     // Rewrite manifest for pushed snapshot, recreating manifest from source on target
@@ -1044,8 +1193,12 @@ pub(crate) async fn push_snapshot(
             MANIFEST_BLOB_NAME.as_ref(),
             upload_options,
         )
-        .await?;
-    backup_writer.finish().await?;
+        .await
+        .with_context(|| prefix.clone())?;
+    backup_writer
+        .finish()
+        .await
+        .with_context(|| prefix.clone())?;
 
     stats.add(SyncStats {
         chunk_count: backup_stats.chunk_count as usize,
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH proxmox-backup v6 15/15] ui: expose group worker setting in sync job edit window
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (13 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 14/15] server: push: prefix log messages and add additional logging Christian Ebner
@ 2026-04-17  9:26 ` Christian Ebner
  2026-04-20 12:33 ` [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Fabian Grünbichler
  2026-04-21 10:28 ` superseded: " Christian Ebner
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-17  9:26 UTC (permalink / raw)
  To: pbs-devel

Allows to configure the number of parallel group works via the web
interface.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 5:
- no changes

 www/window/SyncJobEdit.js | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/www/window/SyncJobEdit.js b/www/window/SyncJobEdit.js
index 074c7855a..26c82bc71 100644
--- a/www/window/SyncJobEdit.js
+++ b/www/window/SyncJobEdit.js
@@ -448,6 +448,17 @@ Ext.define('PBS.window.SyncJobEdit', {
                             deleteEmpty: '{!isCreate}',
                         },
                     },
+                    {
+                        xtype: 'proxmoxintegerfield',
+                        name: 'worker-threads',
+                        fieldLabel: gettext('# of Group Workers'),
+                        emptyText: '1',
+                        minValue: 1,
+                        maxValue: 32,
+                        cbind: {
+                            deleteEmpty: '{!isCreate}',
+                        },
+                    },
                     {
                         xtype: 'proxmoxcheckbox',
                         fieldLabel: gettext('Re-sync Corrupt'),
-- 
2.47.3





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 03/15] tools: implement buffered logger for concurrent log messages
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 03/15] tools: implement buffered logger for concurrent log messages Christian Ebner
@ 2026-04-20 10:57   ` Fabian Grünbichler
  2026-04-20 17:15     ` Christian Ebner
  0 siblings, 1 reply; 29+ messages in thread
From: Fabian Grünbichler @ 2026-04-20 10:57 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

On April 17, 2026 11:26 am, Christian Ebner wrote:
> Implements a buffered logger instance which collects messages send
> from different sender instances via an async tokio channel and
> buffers them. Sender identify by label and provide a log level for
> each log line to be buffered and flushed.
> 
> On collection, log lines are grouped by label and buffered in
> sequence of arrival per label, up to the configured maximum number of
> per group lines or periodically with the configured interval. The
> interval timeout is reset when contents are flushed. In addition,
> senders can request flushing at any given point.
> 
> When the timeout set based on the interval is reached, all labels
> log buffers are flushed. There is no guarantee on the order of labels
> when flushing.
> 
> Log output is written based on provided log line level and prefixed
> by the label.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 5:
> - not present in previous version
> 
>  pbs-tools/Cargo.toml             |   2 +
>  pbs-tools/src/buffered_logger.rs | 216 +++++++++++++++++++++++++++++++
>  pbs-tools/src/lib.rs             |   1 +
>  3 files changed, 219 insertions(+)
>  create mode 100644 pbs-tools/src/buffered_logger.rs
> 
> diff --git a/pbs-tools/Cargo.toml b/pbs-tools/Cargo.toml
> index 998e3077e..6b1d92fa6 100644
> --- a/pbs-tools/Cargo.toml
> +++ b/pbs-tools/Cargo.toml
> @@ -17,10 +17,12 @@ openssl.workspace = true
>  serde_json.workspace = true
>  # rt-multi-thread is required for block_in_place
>  tokio = { workspace = true, features = [ "fs", "io-util", "rt", "rt-multi-thread", "sync" ] }
> +tracing.workspace = true
>  
>  proxmox-async.workspace = true
>  proxmox-io = { workspace = true, features = [ "tokio" ] }
>  proxmox-human-byte.workspace = true
> +proxmox-log.workspace = true
>  proxmox-sys.workspace = true
>  proxmox-time.workspace = true
>  
> diff --git a/pbs-tools/src/buffered_logger.rs b/pbs-tools/src/buffered_logger.rs
> new file mode 100644
> index 000000000..39cf068cd
> --- /dev/null
> +++ b/pbs-tools/src/buffered_logger.rs
> @@ -0,0 +1,216 @@
> +//! Log aggregator to collect and group messages send from concurrent tasks via

nit: sen*t*

> +//! a tokio channel.
> +
> +use std::collections::hash_map::Entry;
> +use std::collections::HashMap;
> +use std::time::Duration;
> +
> +use anyhow::Error;
> +use tokio::sync::mpsc;
> +use tokio::time::{self, Instant};
> +use tracing::{debug, error, info, trace, warn, Level};
> +
> +use proxmox_log::LogContext;
> +
> +/// Label to be used to group currently buffered messages when flushing.

I'd drop the currently here

> +pub type SenderLabel = String;
> +
> +/// Requested action for the log collection task
> +enum SenderRequest {
> +    // new log line to be buffered
> +    Message(LogLine),

see below, I think this should have the label split out

> +    // flush currently buffered log lines associated by sender label
> +    Flush(SenderLabel),

this is actually used at the moment to finish a particular label, maybe
that should be made explicit (see below)

> +}
> +
> +/// Logger instance to buffer and group log output to keep concurrent logs readable
> +///
> +/// Receives the logs from an async input channel, buffers them grouped by input
> +/// channel and flushes them after either reaching a timeout or capacity limit.

/// Receives log lines tagged with a label, and buffers them grouped
/// by the label value. Buffered messages are flushed either after
/// reaching a certain timeout or capacity limit, or when explicitly
/// requested.

> +pub struct BufferedLogger {
> +    // buffer to aggregate log lines based on sender label
> +    buffer_map: HashMap<SenderLabel, Vec<LogLine>>,

do we expect this to be always used with the tiny limits we currently
employ? if so, we might want to consider a different structure here that
is more efficient/optimized for that use case?

also note that this effectively duplicates the label once per line..

> +    // maximum number of received lines for an individual sender instance before
> +    // flushing
> +    max_buffered_lines: usize,
> +    // maximum aggregation duration of received lines for an individual sender
> +    // instance before flushing
> +    max_aggregation_time: Duration,
> +    // channel to receive log messages
> +    receiver: mpsc::Receiver<SenderRequest>,
> +}
> +
> +/// Instance to create new sender instances by cloning the channel sender
> +pub struct LogLineSenderBuilder {
> +    // to clone new senders if requested
> +    _sender: mpsc::Sender<SenderRequest>,

nit: this should be called `sender`, it is used below even if just for
cloning?

> +}
> +
> +impl LogLineSenderBuilder {
> +    /// Create new sender instance to send log messages, to be grouped by given label
> +    ///
> +    /// Label is not checked to be unique (no other instance with same label exists),
> +    /// it is the callers responsibility to check so if required.
> +    pub fn sender_with_label(&self, label: SenderLabel) -> LogLineSender {
> +        LogLineSender {
> +            label,
> +            sender: self._sender.clone(),
> +        }
> +    }
> +}
> +
> +/// Sender to publish new log messages to buffered log aggregator

this sender doesn't publish anything

> +pub struct LogLineSender {
> +    // label used to group log lines
> +    label: SenderLabel,
> +    // sender to publish new log lines to buffered log aggregator task
> +    sender: mpsc::Sender<SenderRequest>,
> +}
> +
> +impl LogLineSender {
> +    /// Send a new log message with given level to the buffered logger task
> +    pub async fn log(&self, level: Level, message: String) -> Result<(), Error> {
> +        let line = LogLine {
> +            label: self.label.clone(),
> +            level,
> +            message,
> +        };
> +        self.sender.send(SenderRequest::Message(line)).await?;
> +        Ok(())
> +    }
> +
> +    /// Flush all messages with sender's label

Flush all *buffered* messages with *this* sender's label

?

> +    pub async fn flush(&self) -> Result<(), Error> {
> +        self.sender
> +            .send(SenderRequest::Flush(self.label.clone()))
> +            .await?;
> +        Ok(())
> +    }
> +}
> +
> +/// Log message entity
> +struct LogLine {
> +    /// label indentifiying the sender

nit: typo, capitalization inconsistent

> +    label: SenderLabel,
> +    /// Log level to use during flushing

Log level of the message

?

> +    level: Level,
> +    /// log line to be buffered and flushed

Log message

?

buffering and flushing happens elsewhere..

> +    message: String,
> +}
> +
> +impl BufferedLogger {
> +    /// New instance of a buffered logger
> +    pub fn new(
> +        max_buffered_lines: usize,
> +        max_aggregation_time: Duration,
> +    ) -> (Self, LogLineSenderBuilder) {
> +        let (_sender, receiver) = mpsc::channel(100);

nit: this should be called `sender`

> +
> +        (
> +            Self {
> +                buffer_map: HashMap::new(),
> +                max_buffered_lines,
> +                max_aggregation_time,
> +                receiver,
> +            },
> +            LogLineSenderBuilder { _sender },
> +        )
> +    }
> +
> +    /// Starts the collection loop spawned on a new tokio task
> +    /// Finishes when all sender belonging to the channel have been dropped.
> +    pub fn run_log_collection(mut self) {
> +        let future = async move {
> +            loop {
> +                let deadline = Instant::now() + self.max_aggregation_time;
> +                match time::timeout_at(deadline, self.receive_log_line()).await {

why manually calculate the deadline, wouldn't using `time::timeout` work
as well? the only difference from a quick glance is that that one does a
checked_add for now + delay..

but also, isn't this kind of broken in any case? let's say I have two
labels A and B:

 0.99 A1
 1.98 A2
 2.97 A3
 3.96 A4
 4.95 A5 (now A is at capacity)
 5.94 B1
 9.90 B5 (now B is at capacity as well)

either

10.90 timeout elapses, everything is flushed

or

10.89 A6 (A gets flushed and can start over - but B hasn't been flushed)
11.88 A7
12.87 A8
13.86 A9
14.85 A10 (A has 5 buffered messages again)
..

this means that any label that doesn't log a 6th message can stall for
quite a long time, as long as other labels make progress (and it isn't
flushed explicitly)?

> +                    Ok(finished) => {
> +                        if finished {
> +                            break;
> +                        }
> +                    }
> +                    Err(_timeout) => self.flush_all_buffered(),
> +                }
> +            }
> +        };
> +        match LogContext::current() {
> +            None => tokio::spawn(future),
> +            Some(context) => tokio::spawn(context.scope(future)),
> +        };
> +    }
> +
> +    /// Collects new log lines, buffers and flushes them if max lines limit exceeded.
> +    ///
> +    /// Returns `true` if all the senders have been dropped and the task should no
> +    /// longer wait for new messages and finish.
> +    async fn receive_log_line(&mut self) -> bool {
> +        if let Some(request) = self.receiver.recv().await {
> +            match request {
> +                SenderRequest::Flush(label) => {
> +                    if let Some(log_lines) = self.buffer_map.get_mut(&label) {
> +                        Self::log_with_label(&label, log_lines);
> +                        log_lines.clear();
> +                    }
> +                }
> +                SenderRequest::Message(log_line) => {

if this would be Message((label, level, line)) or Message((label,
level_and_line)) the label would not need to be stored in the buffer
keys and values..

> +                    if self.max_buffered_lines == 0
> +                        || self.max_aggregation_time < Duration::from_secs(0)

the timeout can never be below zero, as that is the minimum duration
(duration is unsigned)?

> +                    {
> +                        // shortcut if no buffering should happen
> +                        Self::log_by_level(&log_line.label, &log_line);

shouldn't we rather handle this by not using the buffered logger in the
first place? e.g., have this and a simple not-buffering logger implement
a shared logging trait, or something similar?

one simple approach would be to just make the LogLineSender log directly
in this case, and not send anything at all?

because if we don't want buffering, sending all log messages through a
channel and setting up the timeout machinery can be avoided completely..

> +                    }
> +
> +                    match self.buffer_map.entry(log_line.label.clone()) {
> +                        Entry::Occupied(mut occupied) => {
> +                            let log_lines = occupied.get_mut();
> +                            if log_lines.len() + 1 > self.max_buffered_lines {
> +                                // reached limit for this label,
> +                                // flush all buffered and new log line
> +                                Self::log_with_label(&log_line.label, log_lines);
> +                                log_lines.clear();
> +                                Self::log_by_level(&log_line.label, &log_line);
> +                            } else {
> +                                // below limit, push to buffer to flush later
> +                                log_lines.push(log_line);
> +                            }
> +                        }
> +                        Entry::Vacant(vacant) => {
> +                            vacant.insert(vec![log_line]);
> +                        }
> +                    }
> +                }
> +            }
> +            return false;
> +        }
> +
> +        // no more senders, all LogLineSender's and LogLineSenderBuilder have been dropped

nit: typo `'s`

> +        self.flush_all_buffered();
> +        true
> +    }
> +
> +    /// Flush all currently buffered contents without ordering, but grouped by label
> +    fn flush_all_buffered(&mut self) {
> +        for (label, log_lines) in self.buffer_map.iter() {
> +            Self::log_with_label(label, log_lines);
> +        }
> +        self.buffer_map.clear();

wouldn't it be better performance wise to
- clear each label's log lines (like in SendRequest::Flush)
- remove the hashmap entry in SendRequest::Flush, or rename that one to
  finish, since that is what it actually does?

granted, this only triggers when the timeout elapses or there are no
more senders, but for the timeout case it might still be beneficial? it
should remove a lot of allocation churn at least..

> +    }
> +
> +    /// Log given log lines prefixed by label
> +    fn log_with_label(label: &str, log_lines: &[LogLine]) {

currently each LogLine contains the label anyway, but see above, I do
think this split makes sense but it should be done completely ;)

> +        for log_line in log_lines {
> +            Self::log_by_level(label, log_line);
> +        }
> +    }
> +
> +    /// Write the given log line prefixed by label
> +    fn log_by_level(label: &str, log_line: &LogLine) {

this also logs with label, IMHO the naming is confusing..

> +        match log_line.level {
> +            Level::ERROR => error!("[{label}]: {}", log_line.message),
> +            Level::WARN => warn!("[{label}]: {}", log_line.message),
> +            Level::INFO => info!("[{label}]: {}", log_line.message),
> +            Level::DEBUG => debug!("[{label}]: {}", log_line.message),
> +            Level::TRACE => trace!("[{label}]: {}", log_line.message),
> +        }
> +    }
> +}
> diff --git a/pbs-tools/src/lib.rs b/pbs-tools/src/lib.rs
> index f41aef6df..1e3972c92 100644
> --- a/pbs-tools/src/lib.rs
> +++ b/pbs-tools/src/lib.rs
> @@ -1,4 +1,5 @@
>  pub mod async_lru_cache;
> +pub mod buffered_logger;
>  pub mod cert;
>  pub mod crypt_config;
>  pub mod format;
> -- 
> 2.47.3
> 
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 04/15] tools: add bounded join set to run concurrent tasks bound by limit
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 04/15] tools: add bounded join set to run concurrent tasks bound by limit Christian Ebner
@ 2026-04-20 11:15   ` Fabian Grünbichler
  0 siblings, 0 replies; 29+ messages in thread
From: Fabian Grünbichler @ 2026-04-20 11:15 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

On April 17, 2026 11:26 am, Christian Ebner wrote:
> The BoundedJoinSet allows to run tasks concurrently via a JoinSet,
> but constrains the number of concurrent tasks to be run at once by an
> upper limit.

constrains the number of concurrent tasks by an upper limit.

"concurrent tasks" already states that they are at the same time ;)

> 
> In contrast to the ParallelHandler implementation, which is purely
> sync implementation and does not provide easy handling for returned
> results, rhis allows to execute tasks in an async context with straight

which is *a* purely

s/rhis/this

> forward handling of results, as required for e.g. pulling/pushing of
> backup groups in parallel for sync jobs. Also, log context is easily
> preserved, which is of importance for task logging.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 5:
> - not present in previous version, refactored logic from previous
>   GroupWorker implementation.
> 
>  pbs-tools/src/bounded_join_set.rs | 69 +++++++++++++++++++++++++++++++
>  pbs-tools/src/lib.rs              |  1 +
>  2 files changed, 70 insertions(+)
>  create mode 100644 pbs-tools/src/bounded_join_set.rs
> 
> diff --git a/pbs-tools/src/bounded_join_set.rs b/pbs-tools/src/bounded_join_set.rs
> new file mode 100644
> index 000000000..01b27b2a6
> --- /dev/null
> +++ b/pbs-tools/src/bounded_join_set.rs
> @@ -0,0 +1,69 @@
> +//! JoinSet with an upper bound of concurrent tasks.
> +//!
> +//! Allows to run up to the configured number of tasks concurrently in an async
> +//! context.
> +
> +use std::future::Future;
> +
> +use tokio::task::{JoinError, JoinSet};
> +
> +use proxmox_log::LogContext;
> +
> +/// Run up to preconfigured number of futures concurrently on tokio tasks.
> +pub struct BoundedJoinSet<T> {
> +    // upper bound for concurrent task execution
> +    max_tasks: usize,
> +    // handles to currently active tasks
> +    workers: JoinSet<T>,

the tasks might also no longer be active - what they have in common is
that they've been spawned ;)

> +}
> +
> +impl<T: Send + 'static> BoundedJoinSet<T> {
> +    /// Create a new join set with up to `max_task` concurrently executed tasks.
> +    pub fn new(max_tasks: usize) -> Self {
> +        Self {
> +            max_tasks,
> +            workers: JoinSet::new(),
> +        }
> +    }
> +
> +    /// Spawn the given task on the workers, waiting until there is capacity to do so.
> +    ///
> +    /// If there is no capacity, this will await until there is so, returning the results
> +    /// for the finished task(s) providing the now free running slot in order of completion
> +    /// or a `JoinError` if joining failed.
> +    pub async fn spawn_task<F>(&mut self, task: F) -> Result<Vec<T>, JoinError>
> +    where
> +        F: Future<Output = T>,
> +        F: Send + 'static,
> +    {
> +        let mut results = Vec::with_capacity(self.workers.len());
> +
> +        while self.workers.len() >= self.max_tasks {
> +            // capacity reached, wait for an active task to complete
> +            if let Some(result) = self.workers.join_next().await {
> +                results.push(result?);
> +            }
> +        }

by virtue of its design, there can only ever be a single result returned
here (because the join set can only be at capacity, not over), right?

should we assert this here and encode it in the return value? or did you
actually intend for this to return all completed tasks? in that case, we
need a loop with try_join_next in addition to the blocking call.. that
might actually be benefitial, to log results early..

similar question - do we want to be able to retrieve individual results
as they become available, without the need to spawn new tasks? e.g.,
have a `join_next` that the sync jobs can call in a loop once they've
spawned all the groups they want to spawn? that would make `join_active`
below not used at the moment, though it might still be helpful for some
future use case?

> +
> +        match LogContext::current() {
> +            Some(context) => self.workers.spawn(context.scope(task)),
> +            None => self.workers.spawn(task),
> +        };
> +
> +        Ok(results)
> +    }
> +
> +    /// Wait on all active tasks to run to completion.
> +    ///
> +    /// Returns the results for each task in order of completion or a `JoinError`
> +    /// if joining failed.
> +    pub async fn join_active(&mut self) -> Result<Vec<T>, JoinError> {

the active here is a misnomer as well.. for a join_set this is called
join_all (modulo the panic behaviour, which we do not want here, so
maybe we do need a different name..)

> +        let mut results = Vec::with_capacity(self.workers.len());
> +
> +        while let Some(result) = self.workers.join_next().await {
> +            results.push(result?);
> +        }
> +
> +        Ok(results)
> +    }
> +}
> diff --git a/pbs-tools/src/lib.rs b/pbs-tools/src/lib.rs
> index 1e3972c92..dc55366b6 100644
> --- a/pbs-tools/src/lib.rs
> +++ b/pbs-tools/src/lib.rs
> @@ -1,4 +1,5 @@
>  pub mod async_lru_cache;
> +pub mod bounded_join_set;
>  pub mod buffered_logger;
>  pub mod cert;
>  pub mod crypt_config;
> -- 
> 2.47.3
> 
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context Christian Ebner
@ 2026-04-20 11:56   ` Fabian Grünbichler
  2026-04-21  7:21     ` Christian Ebner
  2026-04-21 12:57     ` Thomas Lamprecht
  0 siblings, 2 replies; 29+ messages in thread
From: Fabian Grünbichler @ 2026-04-20 11:56 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

On April 17, 2026 11:26 am, Christian Ebner wrote:
> Pulling groups and therefore also snapshots in parallel leads to
> unordered log outputs, making it mostly impossible to relate a log
> message to a backup snapshot/group.
> 
> Therefore, prefix pull job log messages by the corresponding group or
> snapshot and set the error context accordingly.
> 
> Also, reword some messages, inline variables in format strings and
> start log lines with capital letters to get consistent output.
> 
> By using the buffered logger implementation and buffer up to 5 lines
> with a timeout of 1 second, subsequent log lines arriving in fast
> succession are kept together, reducing the mixing of lines.
> 
> Example output for a sequential pull job:
> ```
> ...
> [ct/100]: 2025-11-17T10:11:42Z: start sync
> [ct/100]: 2025-11-17T10:11:42Z: pct.conf.blob: sync archive
> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: sync archive
> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: downloaded 16.785 MiB (280.791 MiB/s)
> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: sync archive
> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: downloaded 65.703 KiB (29.1 MiB/s)
> [ct/100]: 2025-11-17T10:11:42Z: sync done
> [ct/100]: percentage done: 9.09% (1/11 groups)
> [ct/101]: 2026-03-31T12:20:16Z: start sync
> [ct/101]: 2026-03-31T12:20:16Z: pct.conf.blob: sync archive
> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: sync archive
> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: downloaded 199.806 MiB (311.91 MiB/s)
> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: sync archive
> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: downloaded 180.379 KiB (22.748 MiB/s)
> [ct/101]: 2026-03-31T12:20:16Z: sync done

this is probably the wrong patch for this comment, but since you have
the sample output here ;)

should we pad the labels? the buffered logger knows all currently active
labels and could adapt it? otherwise especially for host backups the
logs are again not really scannable by humans, because there will be
weird jumps in alignment.. or (.. continued below)

> ...
> ```
> 
> Example output for a parallel pull job:
> ```
> ...
> [ct/107]: 2025-07-16T09:14:01Z: start sync
> [ct/107]: 2025-07-16T09:14:01Z: pct.conf.blob: sync archive
> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: sync archive
> [vm/108]: 2025-09-19T07:37:19Z: start sync
> [vm/108]: 2025-09-19T07:37:19Z: qemu-server.conf.blob: sync archive
> [vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: sync archive
> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: downloaded 609.233 MiB (112.628 MiB/s)
> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: sync archive
> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: downloaded 1.172 MiB (17.838 MiB/s)
> [ct/107]: 2025-07-16T09:14:01Z: sync done
> [ct/107]: percentage done: 72.73% (8/11 groups)

the way the prefix and snapshot are formatted could also be interpreted
at first glance as a timestamp of the log line.. why not just prepend
the prefix on the logger side, and leave it up to the caller to do the
formatting? then we could us "{type}/{id}/" as prefix here? or add
'snapshot' in those lines to make it clear? granted, this is more of an
issue when viewing the log via `proxmox-backup-manager`, as in the UI we
have the log timestamps up front..

and maybe(?) log the progress line using a different prefix? because
right now the information that the group [ct/107] is finished is not
really clear from the output, IMHO.

the progress logging is also still broken (this is for a sync that takes
a while, this is not log messages being buffered and re-ordered!):

$ proxmox-backup-manager task log 'UPID:yuna:00070656:001E420F:00000002:69E610EB:syncjob:local\x3atest\x3atank\x3a\x3as\x2dbc01cba6\x2d805a:root@pam:' | grep -e namespace -e 'percentage done'
Syncing datastore 'test', root namespace into datastore 'tank', root namespace
Finished syncing root namespace, current progress: 0 groups, 0 snapshots
Syncing datastore 'test', namespace 'test' into datastore 'tank', namespace 'test'
[host/exclusion-test]: percentage done: 5.26% (1/19 groups)
[host/acltest]: percentage done: 5.26% (1/19 groups)
[host/logtest]: percentage done: 5.26% (1/19 groups)
[host/onemeg]: percentage done: 5.26% (1/19 groups)
[host/fourmeg]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
[host/symlink]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
[host/symlink]: percentage done: 5.26% (1/19 groups)
[host/fourmeg]: percentage done: 5.26% (1/19 groups)
[host/format-v2-test]: percentage done: 1.75% (0/19 groups, 1/3 snapshots in group #1)
[host/format-v2-test]: percentage done: 3.51% (0/19 groups, 2/3 snapshots in group #1)
[host/format-v2-test]: percentage done: 5.26% (1/19 groups)
[host/incrementaltest2]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
[host/incrementaltest2]: percentage done: 5.26% (1/19 groups)

> [vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: downloaded 1.196 GiB (156.892 MiB/s)
> [vm/108]: 2025-09-19T07:37:19Z: sync done
> ...
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 5:
> - uses BufferedLogger implementation, refactored accordingly
> - improve log line prefixes
> - add missing error contexts
> 
>  src/server/pull.rs | 314 +++++++++++++++++++++++++++++++++------------
>  src/server/sync.rs |   8 +-
>  2 files changed, 237 insertions(+), 85 deletions(-)
> 
> diff --git a/src/server/pull.rs b/src/server/pull.rs
> index 611441d2a..f7aae4d59 100644
> --- a/src/server/pull.rs
> +++ b/src/server/pull.rs
> @@ -5,11 +5,11 @@ use std::collections::{HashMap, HashSet};
>  use std::io::Seek;
>  use std::sync::atomic::{AtomicUsize, Ordering};
>  use std::sync::{Arc, Mutex};
> -use std::time::SystemTime;
> +use std::time::{Duration, SystemTime};
>  
>  use anyhow::{bail, format_err, Context, Error};
>  use proxmox_human_byte::HumanByte;
> -use tracing::{info, warn};
> +use tracing::{info, Level};
>  
>  use pbs_api_types::{
>      print_store_and_ns, ArchiveType, Authid, BackupArchiveName, BackupDir, BackupGroup,
> @@ -27,6 +27,7 @@ use pbs_datastore::manifest::{BackupManifest, FileInfo};
>  use pbs_datastore::read_chunk::AsyncReadChunk;
>  use pbs_datastore::{check_backup_owner, DataStore, DatastoreBackend, StoreProgress};
>  use pbs_tools::bounded_join_set::BoundedJoinSet;
> +use pbs_tools::buffered_logger::{BufferedLogger, LogLineSender};
>  use pbs_tools::sha::sha256;
>  
>  use super::sync::{
> @@ -153,6 +154,8 @@ async fn pull_index_chunks<I: IndexFile>(
>      index: I,
>      encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>      backend: &DatastoreBackend,
> +    archive_prefix: &str,
> +    log_sender: Arc<LogLineSender>,
>  ) -> Result<SyncStats, Error> {
>      use futures::stream::{self, StreamExt, TryStreamExt};
>  
> @@ -247,11 +250,16 @@ async fn pull_index_chunks<I: IndexFile>(
>      let bytes = bytes.load(Ordering::SeqCst);
>      let chunk_count = chunk_count.load(Ordering::SeqCst);
>  
> -    info!(
> -        "downloaded {} ({}/s)",
> -        HumanByte::from(bytes),
> -        HumanByte::new_binary(bytes as f64 / elapsed.as_secs_f64()),
> -    );
> +    log_sender
> +        .log(
> +            Level::INFO,
> +            format!(
> +                "{archive_prefix}: downloaded {} ({}/s)",
> +                HumanByte::from(bytes),
> +                HumanByte::new_binary(bytes as f64 / elapsed.as_secs_f64()),
> +            ),
> +        )
> +        .await?;
>  
>      Ok(SyncStats {
>          chunk_count,
> @@ -292,6 +300,7 @@ async fn pull_single_archive<'a>(
>      archive_info: &'a FileInfo,
>      encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>      backend: &DatastoreBackend,
> +    log_sender: Arc<LogLineSender>,
>  ) -> Result<SyncStats, Error> {
>      let archive_name = &archive_info.filename;
>      let mut path = snapshot.full_path();
> @@ -302,72 +311,104 @@ async fn pull_single_archive<'a>(
>  
>      let mut sync_stats = SyncStats::default();
>  
> -    info!("sync archive {archive_name}");
> +    let archive_prefix = format!("{}: {archive_name}", snapshot.backup_time_string());
> +
> +    log_sender
> +        .log(Level::INFO, format!("{archive_prefix}: sync archive"))
> +        .await?;
>  
> -    reader.load_file_into(archive_name, &tmp_path).await?;
> +    reader
> +        .load_file_into(archive_name, &tmp_path)
> +        .await
> +        .with_context(|| archive_prefix.clone())?;
>  
> -    let mut tmpfile = std::fs::OpenOptions::new().read(true).open(&tmp_path)?;
> +    let mut tmpfile = std::fs::OpenOptions::new()
> +        .read(true)
> +        .open(&tmp_path)
> +        .with_context(|| archive_prefix.clone())?;
>  
>      match ArchiveType::from_path(archive_name)? {
>          ArchiveType::DynamicIndex => {
>              let index = DynamicIndexReader::new(tmpfile).map_err(|err| {
> -                format_err!("unable to read dynamic index {:?} - {}", tmp_path, err)
> +                format_err!("{archive_prefix}: unable to read dynamic index {tmp_path:?} - {err}")
>              })?;
>              let (csum, size) = index.compute_csum();
> -            verify_archive(archive_info, &csum, size)?;
> +            verify_archive(archive_info, &csum, size).with_context(|| archive_prefix.clone())?;
>  
>              if reader.skip_chunk_sync(snapshot.datastore().name()) {
> -                info!("skipping chunk sync for same datastore");
> +                log_sender
> +                    .log(
> +                        Level::INFO,
> +                        format!("{archive_prefix}: skipping chunk sync for same datastore"),
> +                    )
> +                    .await?;
>              } else {
>                  let stats = pull_index_chunks(
>                      reader
>                          .chunk_reader(archive_info.crypt_mode)
> -                        .context("failed to get chunk reader")?,
> +                        .context("failed to get chunk reader")
> +                        .with_context(|| archive_prefix.clone())?,
>                      snapshot.datastore().clone(),
>                      index,
>                      encountered_chunks,
>                      backend,
> +                    &archive_prefix,
> +                    Arc::clone(&log_sender),
>                  )
> -                .await?;
> +                .await
> +                .with_context(|| archive_prefix.clone())?;
>                  sync_stats.add(stats);
>              }
>          }
>          ArchiveType::FixedIndex => {
>              let index = FixedIndexReader::new(tmpfile).map_err(|err| {
> -                format_err!("unable to read fixed index '{:?}' - {}", tmp_path, err)
> +                format_err!("{archive_name}: unable to read fixed index '{tmp_path:?}' - {err}")
>              })?;
>              let (csum, size) = index.compute_csum();
> -            verify_archive(archive_info, &csum, size)?;
> +            verify_archive(archive_info, &csum, size).with_context(|| archive_prefix.clone())?;
>  
>              if reader.skip_chunk_sync(snapshot.datastore().name()) {
> -                info!("skipping chunk sync for same datastore");
> +                log_sender
> +                    .log(
> +                        Level::INFO,
> +                        format!("{archive_prefix}: skipping chunk sync for same datastore"),
> +                    )
> +                    .await?;
>              } else {
>                  let stats = pull_index_chunks(
>                      reader
>                          .chunk_reader(archive_info.crypt_mode)
> -                        .context("failed to get chunk reader")?,
> +                        .context("failed to get chunk reader")
> +                        .with_context(|| archive_prefix.clone())?,
>                      snapshot.datastore().clone(),
>                      index,
>                      encountered_chunks,
>                      backend,
> +                    &archive_prefix,
> +                    Arc::clone(&log_sender),
>                  )
> -                .await?;
> +                .await
> +                .with_context(|| archive_prefix.clone())?;
>                  sync_stats.add(stats);
>              }
>          }
>          ArchiveType::Blob => {
> -            tmpfile.rewind()?;
> -            let (csum, size) = sha256(&mut tmpfile)?;
> -            verify_archive(archive_info, &csum, size)?;
> +            proxmox_lang::try_block!({
> +                tmpfile.rewind()?;
> +                let (csum, size) = sha256(&mut tmpfile)?;
> +                verify_archive(archive_info, &csum, size)
> +            })
> +            .with_context(|| archive_prefix.clone())?;
>          }
>      }
>      if let Err(err) = std::fs::rename(&tmp_path, &path) {
> -        bail!("Atomic rename file {:?} failed - {}", path, err);
> +        bail!("{archive_prefix}: Atomic rename file {path:?} failed - {err}");
>      }
>  
>      backend
>          .upload_index_to_backend(snapshot, archive_name)
> -        .await?;
> +        .await
> +        .with_context(|| archive_prefix.clone())?;
>  
>      Ok(sync_stats)
>  }
> @@ -388,13 +429,24 @@ async fn pull_snapshot<'a>(
>      encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>      corrupt: bool,
>      is_new: bool,
> +    log_sender: Arc<LogLineSender>,
>  ) -> Result<SyncStats, Error> {
> +    let prefix = snapshot.backup_time_string().to_owned();
>      if is_new {
> -        info!("sync snapshot {}", snapshot.dir());
> +        log_sender
> +            .log(Level::INFO, format!("{prefix}: start sync"))
> +            .await?;
>      } else if corrupt {
> -        info!("re-sync snapshot {} due to corruption", snapshot.dir());
> +        log_sender
> +            .log(
> +                Level::INFO,
> +                format!("re-sync snapshot {prefix} due to corruption"),
> +            )
> +            .await?;
>      } else {
> -        info!("re-sync snapshot {}", snapshot.dir());
> +        log_sender
> +            .log(Level::INFO, format!("re-sync snapshot {prefix}"))
> +            .await?;
>      }
>  
>      let mut sync_stats = SyncStats::default();
> @@ -409,7 +461,8 @@ async fn pull_snapshot<'a>(
>      let tmp_manifest_blob;
>      if let Some(data) = reader
>          .load_file_into(MANIFEST_BLOB_NAME.as_ref(), &tmp_manifest_name)
> -        .await?
> +        .await
> +        .with_context(|| prefix.clone())?
>      {
>          tmp_manifest_blob = data;
>      } else {
> @@ -419,28 +472,34 @@ async fn pull_snapshot<'a>(
>      if manifest_name.exists() && !corrupt {
>          let manifest_blob = proxmox_lang::try_block!({
>              let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
> -                format_err!("unable to open local manifest {manifest_name:?} - {err}")
> +                format_err!("{prefix}: unable to open local manifest {manifest_name:?} - {err}")
>              })?;
>  
> -            let manifest_blob = DataBlob::load_from_reader(&mut manifest_file)?;
> +            let manifest_blob =
> +                DataBlob::load_from_reader(&mut manifest_file).with_context(|| prefix.clone())?;
>              Ok(manifest_blob)
>          })
>          .map_err(|err: Error| {
> -            format_err!("unable to read local manifest {manifest_name:?} - {err}")
> +            format_err!("{prefix}: unable to read local manifest {manifest_name:?} - {err}")
>          })?;
>  
>          if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
>              if !client_log_name.exists() {
> -                reader.try_download_client_log(&client_log_name).await?;
> +                reader
> +                    .try_download_client_log(&client_log_name)
> +                    .await
> +                    .with_context(|| prefix.clone())?;
>              };
> -            info!("no data changes");
> +            log_sender
> +                .log(Level::INFO, format!("{prefix}: no data changes"))
> +                .await?;
>              let _ = std::fs::remove_file(&tmp_manifest_name);
>              return Ok(sync_stats); // nothing changed
>          }
>      }
>  
>      let manifest_data = tmp_manifest_blob.raw_data().to_vec();
> -    let manifest = BackupManifest::try_from(tmp_manifest_blob)?;
> +    let manifest = BackupManifest::try_from(tmp_manifest_blob).with_context(|| prefix.clone())?;
>  
>      if ignore_not_verified_or_encrypted(
>          &manifest,
> @@ -464,35 +523,54 @@ async fn pull_snapshot<'a>(
>          path.push(&item.filename);
>  
>          if !corrupt && path.exists() {
> -            let filename: BackupArchiveName = item.filename.as_str().try_into()?;
> +            let filename: BackupArchiveName = item
> +                .filename
> +                .as_str()
> +                .try_into()
> +                .with_context(|| prefix.clone())?;
>              match filename.archive_type() {
>                  ArchiveType::DynamicIndex => {
> -                    let index = DynamicIndexReader::open(&path)?;
> +                    let index = DynamicIndexReader::open(&path).with_context(|| prefix.clone())?;
>                      let (csum, size) = index.compute_csum();
>                      match manifest.verify_file(&filename, &csum, size) {
>                          Ok(_) => continue,
>                          Err(err) => {
> -                            info!("detected changed file {path:?} - {err}");
> +                            log_sender
> +                                .log(
> +                                    Level::INFO,
> +                                    format!("{prefix}: detected changed file {path:?} - {err}"),
> +                                )
> +                                .await?;
>                          }
>                      }
>                  }
>                  ArchiveType::FixedIndex => {
> -                    let index = FixedIndexReader::open(&path)?;
> +                    let index = FixedIndexReader::open(&path).with_context(|| prefix.clone())?;
>                      let (csum, size) = index.compute_csum();
>                      match manifest.verify_file(&filename, &csum, size) {
>                          Ok(_) => continue,
>                          Err(err) => {
> -                            info!("detected changed file {path:?} - {err}");
> +                            log_sender
> +                                .log(
> +                                    Level::INFO,
> +                                    format!("{prefix}: detected changed file {path:?} - {err}"),
> +                                )
> +                                .await?;
>                          }
>                      }
>                  }
>                  ArchiveType::Blob => {
> -                    let mut tmpfile = std::fs::File::open(&path)?;
> -                    let (csum, size) = sha256(&mut tmpfile)?;
> +                    let mut tmpfile = std::fs::File::open(&path).with_context(|| prefix.clone())?;
> +                    let (csum, size) = sha256(&mut tmpfile).with_context(|| prefix.clone())?;
>                      match manifest.verify_file(&filename, &csum, size) {
>                          Ok(_) => continue,
>                          Err(err) => {
> -                            info!("detected changed file {path:?} - {err}");
> +                            log_sender
> +                                .log(
> +                                    Level::INFO,
> +                                    format!("{prefix}: detected changed file {path:?} - {err}"),
> +                                )
> +                                .await?;
>                          }
>                      }
>                  }
> @@ -505,13 +583,14 @@ async fn pull_snapshot<'a>(
>              item,
>              encountered_chunks.clone(),
>              backend,
> +            Arc::clone(&log_sender),
>          )
>          .await?;
>          sync_stats.add(stats);
>      }
>  
>      if let Err(err) = std::fs::rename(&tmp_manifest_name, &manifest_name) {
> -        bail!("Atomic rename file {:?} failed - {}", manifest_name, err);
> +        bail!("{prefix}: Atomic rename file {manifest_name:?} failed - {err}");
>      }
>      if let DatastoreBackend::S3(s3_client) = backend {
>          let object_key = pbs_datastore::s3::object_key_from_path(
> @@ -524,33 +603,40 @@ async fn pull_snapshot<'a>(
>          let _is_duplicate = s3_client
>              .upload_replace_with_retry(object_key, data)
>              .await
> -            .context("failed to upload manifest to s3 backend")?;
> +            .context("failed to upload manifest to s3 backend")
> +            .with_context(|| prefix.clone())?;
>      }
>  
>      if !client_log_name.exists() {
> -        reader.try_download_client_log(&client_log_name).await?;
> +        reader
> +            .try_download_client_log(&client_log_name)
> +            .await
> +            .with_context(|| prefix.clone())?;
>          if client_log_name.exists() {
>              if let DatastoreBackend::S3(s3_client) = backend {
>                  let object_key = pbs_datastore::s3::object_key_from_path(
>                      &snapshot.relative_path(),
>                      CLIENT_LOG_BLOB_NAME.as_ref(),
>                  )
> -                .context("invalid archive object key")?;
> +                .context("invalid archive object key")
> +                .with_context(|| prefix.clone())?;
>  
>                  let data = tokio::fs::read(&client_log_name)
>                      .await
> -                    .context("failed to read log file contents")?;
> +                    .context("failed to read log file contents")
> +                    .with_context(|| prefix.clone())?;
>                  let contents = hyper::body::Bytes::from(data);
>                  let _is_duplicate = s3_client
>                      .upload_replace_with_retry(object_key, contents)
>                      .await
> -                    .context("failed to upload client log to s3 backend")?;
> +                    .context("failed to upload client log to s3 backend")
> +                    .with_context(|| prefix.clone())?;
>              }
>          }
>      };
>      snapshot
>          .cleanup_unreferenced_files(&manifest)
> -        .map_err(|err| format_err!("failed to cleanup unreferenced files - {err}"))?;
> +        .map_err(|err| format_err!("{prefix}: failed to cleanup unreferenced files - {err}"))?;
>  
>      Ok(sync_stats)
>  }
> @@ -565,10 +651,14 @@ async fn pull_snapshot_from<'a>(
>      snapshot: &'a pbs_datastore::BackupDir,
>      encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>      corrupt: bool,
> +    log_sender: Arc<LogLineSender>,
>  ) -> Result<SyncStats, Error> {
> +    let prefix = format!("{}", snapshot.backup_time_string());
> +
>      let (_path, is_new, _snap_lock) = snapshot
>          .datastore()
> -        .create_locked_backup_dir(snapshot.backup_ns(), snapshot.as_ref())?;
> +        .create_locked_backup_dir(snapshot.backup_ns(), snapshot.as_ref())
> +        .context(prefix.clone())?;
>  
>      let result = pull_snapshot(
>          params,
> @@ -577,6 +667,7 @@ async fn pull_snapshot_from<'a>(
>          encountered_chunks,
>          corrupt,
>          is_new,
> +        Arc::clone(&log_sender),
>      )
>      .await;
>  
> @@ -589,11 +680,20 @@ async fn pull_snapshot_from<'a>(
>                      snapshot.as_ref(),
>                      true,
>                  ) {
> -                    info!("cleanup error - {cleanup_err}");
> +                    log_sender
> +                        .log(
> +                            Level::INFO,
> +                            format!("{prefix}: cleanup error - {cleanup_err}"),
> +                        )
> +                        .await?;
>                  }
>                  return Err(err);
>              }
> -            Ok(_) => info!("sync snapshot {} done", snapshot.dir()),
> +            Ok(_) => {
> +                log_sender
> +                    .log(Level::INFO, format!("{prefix}: sync done"))
> +                    .await?
> +            }
>          }
>      }
>  
> @@ -622,7 +722,9 @@ async fn pull_group(
>      source_namespace: &BackupNamespace,
>      group: &BackupGroup,
>      shared_group_progress: Arc<SharedGroupProgress>,
> +    log_sender: Arc<LogLineSender>,
>  ) -> Result<SyncStats, Error> {
> +    let prefix = format!("{group}");
>      let mut already_synced_skip_info = SkipInfo::new(SkipReason::AlreadySynced);
>      let mut transfer_last_skip_info = SkipInfo::new(SkipReason::TransferLast);
>  
> @@ -714,11 +816,15 @@ async fn pull_group(
>          .collect();
>  
>      if already_synced_skip_info.count > 0 {
> -        info!("{already_synced_skip_info}");
> +        log_sender
> +            .log(Level::INFO, format!("{prefix}: {already_synced_skip_info}"))
> +            .await?;
>          already_synced_skip_info.reset();
>      }
>      if transfer_last_skip_info.count > 0 {
> -        info!("{transfer_last_skip_info}");
> +        log_sender
> +            .log(Level::INFO, format!("{prefix}: {transfer_last_skip_info}"))
> +            .await?;
>          transfer_last_skip_info.reset();
>      }
>  
> @@ -730,8 +836,8 @@ async fn pull_group(
>          .store
>          .backup_group(target_ns.clone(), group.clone());
>      if let Some(info) = backup_group.last_backup(true).unwrap_or(None) {
> -        let mut reusable_chunks = encountered_chunks.lock().unwrap();
>          if let Err(err) = proxmox_lang::try_block!({
> +            let mut reusable_chunks = encountered_chunks.lock().unwrap();
>              let _snapshot_guard = info
>                  .backup_dir
>                  .lock_shared()
> @@ -780,7 +886,12 @@ async fn pull_group(
>              }
>              Ok::<(), Error>(())
>          }) {
> -            warn!("Failed to collect reusable chunk from last backup: {err:#?}");
> +            log_sender
> +                .log(
> +                    Level::WARN,
> +                    format!("Failed to collect reusable chunk from last backup: {err:#?}"),
> +                )
> +                .await?;
>          }
>      }
>  
> @@ -805,13 +916,16 @@ async fn pull_group(
>              &to_snapshot,
>              encountered_chunks.clone(),
>              corrupt,
> +            Arc::clone(&log_sender),
>          )
>          .await;
>  
>          // Update done groups progress by other parallel running pulls
>          local_progress.done_groups = shared_group_progress.load_done();
>          local_progress.done_snapshots = pos as u64 + 1;
> -        info!("percentage done: group {group}: {local_progress}");
> +        log_sender
> +            .log(Level::INFO, format!("percentage done: {local_progress}"))
> +            .await?;
>  
>          let stats = result?; // stop on error
>          sync_stats.add(stats);
> @@ -829,13 +943,23 @@ async fn pull_group(
>                  continue;
>              }
>              if snapshot.is_protected() {
> -                info!(
> -                    "don't delete vanished snapshot {} (protected)",
> -                    snapshot.dir()
> -                );
> +                log_sender
> +                    .log(
> +                        Level::INFO,
> +                        format!(
> +                            "{prefix}: don't delete vanished snapshot {} (protected)",
> +                            snapshot.dir(),
> +                        ),
> +                    )
> +                    .await?;
>                  continue;
>              }
> -            info!("delete vanished snapshot {}", snapshot.dir());
> +            log_sender
> +                .log(
> +                    Level::INFO,
> +                    format!("delete vanished snapshot {}", snapshot.dir()),
> +                )
> +                .await?;
>              params
>                  .target
>                  .store
> @@ -1035,10 +1159,7 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
>              }
>              Err(err) => {
>                  errors = true;
> -                info!(
> -                    "Encountered errors while syncing namespace {} - {err}",
> -                    &namespace,
> -                );
> +                info!("Encountered errors while syncing namespace {namespace} - {err}");
>              }
>          };
>      }
> @@ -1064,6 +1185,7 @@ async fn lock_and_pull_group(
>      namespace: &BackupNamespace,
>      target_namespace: &BackupNamespace,
>      shared_group_progress: Arc<SharedGroupProgress>,
> +    log_sender: Arc<LogLineSender>,
>  ) -> Result<SyncStats, Error> {
>      let (owner, _lock_guard) =
>          match params
> @@ -1073,25 +1195,47 @@ async fn lock_and_pull_group(
>          {
>              Ok(res) => res,
>              Err(err) => {
> -                info!("sync group {group} failed - group lock failed: {err}");
> -                info!("create_locked_backup_group failed");
> +                log_sender
> +                    .log(
> +                        Level::INFO,
> +                        format!("sync group {group} failed - group lock failed: {err}"),
> +                    )
> +                    .await?;
> +                log_sender
> +                    .log(Level::INFO, "create_locked_backup_group failed".to_string())
> +                    .await?;
>                  return Err(err);
>              }
>          };
>  
>      if params.owner != owner {
>          // only the owner is allowed to create additional snapshots
> -        info!(
> -            "sync group {group} failed - owner check failed ({} != {owner})",
> -            params.owner
> -        );
> +        log_sender
> +            .log(
> +                Level::INFO,
> +                format!(
> +                    "sync group {group} failed - owner check failed ({} != {owner})",
> +                    params.owner,
> +                ),
> +            )
> +            .await?;
>          return Err(format_err!("owner check failed"));
>      }
>  
> -    match pull_group(params, namespace, group, shared_group_progress).await {
> +    match pull_group(
> +        params,
> +        namespace,
> +        group,
> +        shared_group_progress,
> +        Arc::clone(&log_sender),
> +    )
> +    .await
> +    {
>          Ok(stats) => Ok(stats),
>          Err(err) => {
> -            info!("sync group {group} failed - {err:#}");
> +            log_sender
> +                .log(Level::INFO, format!("sync group {group} failed - {err:#}"))
> +                .await?;
>              Err(err)
>          }
>      }
> @@ -1124,7 +1268,7 @@ async fn pull_ns(
>      list.sort_unstable();
>  
>      info!(
> -        "found {} groups to sync (out of {unfiltered_count} total)",
> +        "Found {} groups to sync (out of {unfiltered_count} total)",
>          list.len()
>      );
>  
> @@ -1143,6 +1287,10 @@ async fn pull_ns(
>      let shared_group_progress = Arc::new(SharedGroupProgress::with_total_groups(list.len()));
>      let mut group_workers = BoundedJoinSet::new(params.worker_threads.unwrap_or(1));
>  
> +    let (buffered_logger, sender_builder) = BufferedLogger::new(5, Duration::from_secs(1));
> +    // runs until sender_builder and all senders build from it are being dropped
> +    buffered_logger.run_log_collection();
> +
>      let mut process_results = |results| {
>          for result in results {
>              match result {
> @@ -1160,16 +1308,20 @@ async fn pull_ns(
>          let target_ns = target_ns.clone();
>          let params = Arc::clone(&params);
>          let group_progress_cloned = Arc::clone(&shared_group_progress);
> +        let log_sender = Arc::new(sender_builder.sender_with_label(group.to_string()));
>          let results = group_workers
>              .spawn_task(async move {
> -                lock_and_pull_group(
> +                let result = lock_and_pull_group(
>                      Arc::clone(&params),
>                      &group,
>                      &namespace,
>                      &target_ns,
>                      group_progress_cloned,
> +                    Arc::clone(&log_sender),
>                  )
> -                .await
> +                .await;
> +                let _ = log_sender.flush().await;
> +                result
>              })
>              .await
>              .map_err(|err| format_err!("failed to join on worker task: {err:#}"))?;
> @@ -1197,7 +1349,7 @@ async fn pull_ns(
>                  if !local_group.apply_filters(&params.group_filter) {
>                      continue;
>                  }
> -                info!("delete vanished group '{local_group}'");
> +                info!("Delete vanished group '{local_group}'");
>                  let delete_stats_result = params
>                      .target
>                      .store
> @@ -1206,7 +1358,7 @@ async fn pull_ns(
>                  match delete_stats_result {
>                      Ok(stats) => {
>                          if !stats.all_removed() {
> -                            info!("kept some protected snapshots of group '{local_group}'");
> +                            info!("Kept some protected snapshots of group '{local_group}'");
>                              sync_stats.add(SyncStats::from(RemovedVanishedStats {
>                                  snapshots: stats.removed_snapshots(),
>                                  groups: 0,
> @@ -1229,7 +1381,7 @@ async fn pull_ns(
>              Ok(())
>          });
>          if let Err(err) = result {
> -            info!("error during cleanup: {err}");
> +            info!("Error during cleanup: {err}");
>              errors = true;
>          };
>      }
> diff --git a/src/server/sync.rs b/src/server/sync.rs
> index e88418442..17ed4839f 100644
> --- a/src/server/sync.rs
> +++ b/src/server/sync.rs
> @@ -13,7 +13,6 @@ use futures::{future::FutureExt, select};
>  use hyper::http::StatusCode;
>  use pbs_config::BackupLockGuard;
>  use serde_json::json;
> -use tokio::task::JoinSet;
>  use tracing::{info, warn};
>  
>  use proxmox_human_byte::HumanByte;
> @@ -136,13 +135,13 @@ impl SyncSourceReader for RemoteSourceReader {
>                  Some(HttpError { code, message }) => match *code {
>                      StatusCode::NOT_FOUND => {
>                          info!(
> -                            "skipping snapshot {} - vanished since start of sync",
> +                            "Snapshot {}: skipped because vanished since start of sync",
>                              &self.dir
>                          );
>                          return Ok(None);
>                      }
>                      _ => {
> -                        bail!("HTTP error {code} - {message}");
> +                        bail!("Snapshot {}: HTTP error {code} - {message}", &self.dir);
>                      }
>                  },
>                  None => {
> @@ -176,7 +175,8 @@ impl SyncSourceReader for RemoteSourceReader {
>                  bail!("Atomic rename file {to_path:?} failed - {err}");
>              }
>              info!(
> -                "got backup log file {client_log_name}",
> +                "Snapshot {snapshot}: got backup log file {client_log_name}",
> +                snapshot = &self.dir,
>                  client_log_name = client_log_name.deref()
>              );
>          }
> -- 
> 2.47.3
> 
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 05/15] client: backup writer: fix upload stats size and rate for push sync
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 05/15] client: backup writer: fix upload stats size and rate for push sync Christian Ebner
@ 2026-04-20 12:29   ` Fabian Grünbichler
  0 siblings, 0 replies; 29+ messages in thread
From: Fabian Grünbichler @ 2026-04-20 12:29 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

On April 17, 2026 11:26 am, Christian Ebner wrote:
> Currently, the logical size of the uploaded chunks is used for size
> and upload rate calculation in case of sync jobs in push direction,
> leading to inflated values for the transferred size and rate.
> 
> Use the compressed chunk size instead. To get the required
> information, return the more verbose `UploadStats` on
> `upload_index_chunk_info` calls and use it's compressed size for the
> transferred `bytes` of `SyncStats` instead. Since `UploadStats` is
> now part of a pub api, increase it's scope as well.
> 
> This is then finally being used to display the upload size and
> calculate the rate for the push sync job.

this would make it inconsistent with the logging when doing a regular
backup though:

> linux.mpxar: had to backup 13.587 MiB of 13.587 MiB (compressed 2.784 MiB) in 2.13 s (average 6.392 MiB/s)

from backup_writer.rs:482:

```
        let size_dirty = upload_stats.size - upload_stats.size_reused;
[..]
            let speed: HumanByte =
                ((size_dirty * 1_000_000) / (upload_stats.duration.as_micros() as usize)).into();
            let size_dirty: HumanByte = size_dirty.into();
            let size_compressed: HumanByte = upload_stats.size_compressed.into();
            info!(
                "{archive}: had to backup {size_dirty} of {size} (compressed {size_compressed}) in {:.2} s (average {speed}/s)",
                upload_stats.duration.as_secs_f64()
            );
```

I think if we adapt this, we should adapt it to either that calculation
or unify them?

> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 5:
> - no changes
> 
>  pbs-client/src/backup_stats.rs  | 20 ++++++++++----------
>  pbs-client/src/backup_writer.rs |  4 ++--
>  src/server/push.rs              |  4 ++--
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/pbs-client/src/backup_stats.rs b/pbs-client/src/backup_stats.rs
> index f0563a001..edf7ef3c4 100644
> --- a/pbs-client/src/backup_stats.rs
> +++ b/pbs-client/src/backup_stats.rs
> @@ -15,16 +15,16 @@ pub struct BackupStats {
>  }
>  
>  /// Extended backup run statistics and archive checksum
> -pub(crate) struct UploadStats {
> -    pub(crate) chunk_count: usize,
> -    pub(crate) chunk_reused: usize,
> -    pub(crate) chunk_injected: usize,
> -    pub(crate) size: usize,
> -    pub(crate) size_reused: usize,
> -    pub(crate) size_injected: usize,
> -    pub(crate) size_compressed: usize,
> -    pub(crate) duration: Duration,
> -    pub(crate) csum: [u8; 32],
> +pub struct UploadStats {
> +    pub chunk_count: usize,
> +    pub chunk_reused: usize,
> +    pub chunk_injected: usize,
> +    pub size: usize,
> +    pub size_reused: usize,
> +    pub size_injected: usize,
> +    pub size_compressed: usize,
> +    pub duration: Duration,
> +    pub csum: [u8; 32],
>  }
>  
>  impl UploadStats {
> diff --git a/pbs-client/src/backup_writer.rs b/pbs-client/src/backup_writer.rs
> index 49aff3fdd..4a4391c8b 100644
> --- a/pbs-client/src/backup_writer.rs
> +++ b/pbs-client/src/backup_writer.rs
> @@ -309,7 +309,7 @@ impl BackupWriter {
>          archive_name: &BackupArchiveName,
>          stream: impl Stream<Item = Result<MergedChunkInfo, Error>>,
>          options: UploadOptions,
> -    ) -> Result<BackupStats, Error> {
> +    ) -> Result<UploadStats, Error> {
>          let mut param = json!({ "archive-name": archive_name });
>          let (prefix, archive_size) = options.index_type.to_prefix_and_size();
>          if let Some(size) = archive_size {
> @@ -391,7 +391,7 @@ impl BackupWriter {
>              .post(&format!("{prefix}_close"), Some(param))
>              .await?;
>  
> -        Ok(upload_stats.to_backup_stats())
> +        Ok(upload_stats)
>      }
>  
>      pub async fn upload_stream(
> diff --git a/src/server/push.rs b/src/server/push.rs
> index 697b94f2f..494e0fbce 100644
> --- a/src/server/push.rs
> +++ b/src/server/push.rs
> @@ -1059,8 +1059,8 @@ async fn push_index(
>          .await?;
>  
>      Ok(SyncStats {
> -        chunk_count: upload_stats.chunk_count as usize,
> -        bytes: upload_stats.size as usize,
> +        chunk_count: upload_stats.chunk_count,
> +        bytes: upload_stats.size_compressed,
>          elapsed: upload_stats.duration,
>          removed: None,
>      })
> -- 
> 2.47.3
> 
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (14 preceding siblings ...)
  2026-04-17  9:26 ` [PATCH proxmox-backup v6 15/15] ui: expose group worker setting in sync job edit window Christian Ebner
@ 2026-04-20 12:33 ` Fabian Grünbichler
  2026-04-21 10:28 ` superseded: " Christian Ebner
  16 siblings, 0 replies; 29+ messages in thread
From: Fabian Grünbichler @ 2026-04-20 12:33 UTC (permalink / raw)
  To: pbs-devel, Christian Ebner

On Fri, 17 Apr 2026 11:26:06 +0200, Christian Ebner wrote:
> Syncing contents from/to a remote source via a sync job suffers from
> low throughput on high latency networks because of limitations by the
> HTTP/2 connection, as described in [0]. To improve, syncing multiple
> groups in parallel by establishing multiple reader instances has been
> suggested.
> 
> This patch series implements the functionality by adding the sync job
> configuration property `worker-threads`, allowing to define the
> number of groups pull/push tokio tasks to be executed in parallel on
> the runtime during each job.
> 
> [...]

Applied some of the preparatory patches, thanks!

[02/15] tools: group and sort module imports
        commit: c7fe846a0f0c92594ce91f66e76361f9a2b2c1d3
[07/15] sync: pull: revert avoiding reinstantiation for encountered chunks map
        commit: 173059e31a9dbc7f463019cc92ade864d02af6ef
[08/15] sync: pull: factor out backup group locking and owner check
        commit: 5e1e7bb0de424af1c6466dc750882cd11d9b1675
[09/15] sync: pull: prepare pull parameters to be shared across parallel tasks
        commit: 9213b1ed3fd9f5469b78cc2934f1dcebf13c8fc3
[12/15] sync: push: prepare push parameters to be shared across parallel tasks
        commit: f2843b84f9d2219ff7213691af7f263133a4b039

left the rest out for now, as at least the logging part still needs some
improvements/rework to be really functional, and without functional logging
enabling parallelism is a mess.

I don't think there is that much missing anymore to pull this across the line
though.

Best regards,
-- 
Fabian Grünbichler <f.gruenbichler@proxmox.com>




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 03/15] tools: implement buffered logger for concurrent log messages
  2026-04-20 10:57   ` Fabian Grünbichler
@ 2026-04-20 17:15     ` Christian Ebner
  2026-04-21  6:49       ` Fabian Grünbichler
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Ebner @ 2026-04-20 17:15 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/20/26 12:56 PM, Fabian Grünbichler wrote:
> On April 17, 2026 11:26 am, Christian Ebner wrote:
>> Implements a buffered logger instance which collects messages send
>> from different sender instances via an async tokio channel and
>> buffers them. Sender identify by label and provide a log level for
>> each log line to be buffered and flushed.
>>
>> On collection, log lines are grouped by label and buffered in
>> sequence of arrival per label, up to the configured maximum number of
>> per group lines or periodically with the configured interval. The
>> interval timeout is reset when contents are flushed. In addition,
>> senders can request flushing at any given point.
>>
>> When the timeout set based on the interval is reached, all labels
>> log buffers are flushed. There is no guarantee on the order of labels
>> when flushing.
>>
>> Log output is written based on provided log line level and prefixed
>> by the label.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 5:
>> - not present in previous version
>>
>>   pbs-tools/Cargo.toml             |   2 +
>>   pbs-tools/src/buffered_logger.rs | 216 +++++++++++++++++++++++++++++++
>>   pbs-tools/src/lib.rs             |   1 +
>>   3 files changed, 219 insertions(+)
>>   create mode 100644 pbs-tools/src/buffered_logger.rs
>>
>> diff --git a/pbs-tools/Cargo.toml b/pbs-tools/Cargo.toml
>> index 998e3077e..6b1d92fa6 100644
>> --- a/pbs-tools/Cargo.toml
>> +++ b/pbs-tools/Cargo.toml
>> @@ -17,10 +17,12 @@ openssl.workspace = true
>>   serde_json.workspace = true
>>   # rt-multi-thread is required for block_in_place
>>   tokio = { workspace = true, features = [ "fs", "io-util", "rt", "rt-multi-thread", "sync" ] }
>> +tracing.workspace = true
>>   
>>   proxmox-async.workspace = true
>>   proxmox-io = { workspace = true, features = [ "tokio" ] }
>>   proxmox-human-byte.workspace = true
>> +proxmox-log.workspace = true
>>   proxmox-sys.workspace = true
>>   proxmox-time.workspace = true
>>   
>> diff --git a/pbs-tools/src/buffered_logger.rs b/pbs-tools/src/buffered_logger.rs
>> new file mode 100644
>> index 000000000..39cf068cd
>> --- /dev/null
>> +++ b/pbs-tools/src/buffered_logger.rs
>> @@ -0,0 +1,216 @@
>> +//! Log aggregator to collect and group messages send from concurrent tasks via
> 
> nit: sen*t*
> 
>> +//! a tokio channel.
>> +
>> +use std::collections::hash_map::Entry;
>> +use std::collections::HashMap;
>> +use std::time::Duration;
>> +
>> +use anyhow::Error;
>> +use tokio::sync::mpsc;
>> +use tokio::time::{self, Instant};
>> +use tracing::{debug, error, info, trace, warn, Level};
>> +
>> +use proxmox_log::LogContext;
>> +
>> +/// Label to be used to group currently buffered messages when flushing.
> 
> I'd drop the currently here
> 
>> +pub type SenderLabel = String;
>> +
>> +/// Requested action for the log collection task
>> +enum SenderRequest {
>> +    // new log line to be buffered
>> +    Message(LogLine),
> 
> see below, I think this should have the label split out
> 
>> +    // flush currently buffered log lines associated by sender label
>> +    Flush(SenderLabel),
> 
> this is actually used at the moment to finish a particular label, maybe
> that should be made explicit (see below)
> 
>> +}
>> +
>> +/// Logger instance to buffer and group log output to keep concurrent logs readable
>> +///
>> +/// Receives the logs from an async input channel, buffers them grouped by input
>> +/// channel and flushes them after either reaching a timeout or capacity limit.
> 
> /// Receives log lines tagged with a label, and buffers them grouped
> /// by the label value. Buffered messages are flushed either after
> /// reaching a certain timeout or capacity limit, or when explicitly
> /// requested.
> 
>> +pub struct BufferedLogger {
>> +    // buffer to aggregate log lines based on sender label
>> +    buffer_map: HashMap<SenderLabel, Vec<LogLine>>,
> 
> do we expect this to be always used with the tiny limits we currently
> employ? if so, we might want to consider a different structure here that
> is more efficient/optimized for that use case?

Yes, at the moment I do not foresee any users of this which would 
require larger limits. What struct do you suggest though? A Vec suffers 
from issues when deleting no longer needed items, so a linked list?

> also note that this effectively duplicates the label once per line..

Yes, it is better to factor out the `SenderLabel` from the `LogLine`, 
and send new messages via `Message(SenderLabel, LogLine)` instead. 
Adapted for v7.

> 
>> +    // maximum number of received lines for an individual sender instance before
>> +    // flushing
>> +    max_buffered_lines: usize,
>> +    // maximum aggregation duration of received lines for an individual sender
>> +    // instance before flushing
>> +    max_aggregation_time: Duration,
>> +    // channel to receive log messages
>> +    receiver: mpsc::Receiver<SenderRequest>,
>> +}
>> +
>> +/// Instance to create new sender instances by cloning the channel sender
>> +pub struct LogLineSenderBuilder {
>> +    // to clone new senders if requested
>> +    _sender: mpsc::Sender<SenderRequest>,
> 
> nit: this should be called `sender`, it is used below even if just for
> cloning?

Acked, adapting this for v7.

> 
>> +}
>> +
>> +impl LogLineSenderBuilder {
>> +    /// Create new sender instance to send log messages, to be grouped by given label
>> +    ///
>> +    /// Label is not checked to be unique (no other instance with same label exists),
>> +    /// it is the callers responsibility to check so if required.
>> +    pub fn sender_with_label(&self, label: SenderLabel) -> LogLineSender {
>> +        LogLineSender {
>> +            label,
>> +            sender: self._sender.clone(),
>> +        }
>> +    }
>> +}
>> +
>> +/// Sender to publish new log messages to buffered log aggregator
> 
> this sender doesn't publish anything
> 
>> +pub struct LogLineSender {
>> +    // label used to group log lines
>> +    label: SenderLabel,
>> +    // sender to publish new log lines to buffered log aggregator task
>> +    sender: mpsc::Sender<SenderRequest>,
>> +}
>> +
>> +impl LogLineSender {
>> +    /// Send a new log message with given level to the buffered logger task
>> +    pub async fn log(&self, level: Level, message: String) -> Result<(), Error> {
>> +        let line = LogLine {
>> +            label: self.label.clone(),
>> +            level,
>> +            message,
>> +        };
>> +        self.sender.send(SenderRequest::Message(line)).await?;
>> +        Ok(())
>> +    }
>> +
>> +    /// Flush all messages with sender's label
> 
> Flush all *buffered* messages with *this* sender's label
> 
> ?
> 
>> +    pub async fn flush(&self) -> Result<(), Error> {
>> +        self.sender
>> +            .send(SenderRequest::Flush(self.label.clone()))
>> +            .await?;
>> +        Ok(())
>> +    }
>> +}
>> +
>> +/// Log message entity
>> +struct LogLine {
>> +    /// label indentifiying the sender
> 
> nit: typo, capitalization inconsistent
> 
>> +    label: SenderLabel,
>> +    /// Log level to use during flushing
> 
> Log level of the message
> 
> ?
> 
>> +    level: Level,
>> +    /// log line to be buffered and flushed
> 
> Log message
> 
> ?
> 
> buffering and flushing happens elsewhere..
> 
>> +    message: String,
>> +}
>> +
>> +impl BufferedLogger {
>> +    /// New instance of a buffered logger
>> +    pub fn new(
>> +        max_buffered_lines: usize,
>> +        max_aggregation_time: Duration,
>> +    ) -> (Self, LogLineSenderBuilder) {
>> +        let (_sender, receiver) = mpsc::channel(100);
> 
> nit: this should be called `sender`
> 
>> +
>> +        (
>> +            Self {
>> +                buffer_map: HashMap::new(),
>> +                max_buffered_lines,
>> +                max_aggregation_time,
>> +                receiver,
>> +            },
>> +            LogLineSenderBuilder { _sender },
>> +        )
>> +    }
>> +
>> +    /// Starts the collection loop spawned on a new tokio task
>> +    /// Finishes when all sender belonging to the channel have been dropped.
>> +    pub fn run_log_collection(mut self) {
>> +        let future = async move {
>> +            loop {
>> +                let deadline = Instant::now() + self.max_aggregation_time;
>> +                match time::timeout_at(deadline, self.receive_log_line()).await {
> 
> why manually calculate the deadline, wouldn't using `time::timeout` work
> as well? the only difference from a quick glance is that that one does a
> checked_add for now + delay..

No specific reason for using timeout_at() here, was primed by having 
based this on the s3 client timeout

> 
> but also, isn't this kind of broken in any case? let's say I have two
> labels A and B:
> 
>   0.99 A1
>   1.98 A2
>   2.97 A3
>   3.96 A4
>   4.95 A5 (now A is at capacity)
>   5.94 B1
>   9.90 B5 (now B is at capacity as well)
> 
> either
> 
> 10.90 timeout elapses, everything is flushed
> 
> or
> 
> 10.89 A6 (A gets flushed and can start over - but B hasn't been flushed)
> 11.88 A7
> 12.87 A8
> 13.86 A9
> 14.85 A10 (A has 5 buffered messages again)
> ..
> 
> this means that any label that doesn't log a 6th message can stall for
> quite a long time, as long as other labels make progress (and it isn't
> flushed explicitly)?

Yes, this is true, but that is not really avoidable unless there is a 
timeout per label. Or would you suggest to simply flush all buffered 
lines at periodic intervals, without resetting at all?


> 
>> +                    Ok(finished) => {
>> +                        if finished {
>> +                            break;
>> +                        }
>> +                    }
>> +                    Err(_timeout) => self.flush_all_buffered(),
>> +                }
>> +            }
>> +        };
>> +        match LogContext::current() {
>> +            None => tokio::spawn(future),
>> +            Some(context) => tokio::spawn(context.scope(future)),
>> +        };
>> +    }
>> +
>> +    /// Collects new log lines, buffers and flushes them if max lines limit exceeded.
>> +    ///
>> +    /// Returns `true` if all the senders have been dropped and the task should no
>> +    /// longer wait for new messages and finish.
>> +    async fn receive_log_line(&mut self) -> bool {
>> +        if let Some(request) = self.receiver.recv().await {
>> +            match request {
>> +                SenderRequest::Flush(label) => {
>> +                    if let Some(log_lines) = self.buffer_map.get_mut(&label) {
>> +                        Self::log_with_label(&label, log_lines);
>> +                        log_lines.clear();
>> +                    }
>> +                }
>> +                SenderRequest::Message(log_line) => {
> 
> if this would be Message((label, level, line)) or Message((label,
> level_and_line)) the label would not need to be stored in the buffer
> keys and values..

Yes, adapted based on above mention already

> 
>> +                    if self.max_buffered_lines == 0
>> +                        || self.max_aggregation_time < Duration::from_secs(0)
> 
> the timeout can never be below zero, as that is the minimum duration
> (duration is unsigned)?

This is a typo, the intention was to check for durations below 1 second, 
but since the granularity is seconds, this should check for 0 instead.

> 
>> +                    {
>> +                        // shortcut if no buffering should happen
>> +                        Self::log_by_level(&log_line.label, &log_line);
> 
> shouldn't we rather handle this by not using the buffered logger in the
> first place? e.g., have this and a simple not-buffering logger implement
> a shared logging trait, or something similar?

Hmm, that might be better, yes. Will add a trait with 2 implementations 
based on which logger is required.

> 
> one simple approach would be to just make the LogLineSender log directly
> in this case, and not send anything at all?
> 
> because if we don't want buffering, sending all log messages through a
> channel and setting up the timeout machinery can be avoided completely..
> 
>> +                    }
>> +
>> +                    match self.buffer_map.entry(log_line.label.clone()) {
>> +                        Entry::Occupied(mut occupied) => {
>> +                            let log_lines = occupied.get_mut();
>> +                            if log_lines.len() + 1 > self.max_buffered_lines {
>> +                                // reached limit for this label,
>> +                                // flush all buffered and new log line
>> +                                Self::log_with_label(&log_line.label, log_lines);
>> +                                log_lines.clear();
>> +                                Self::log_by_level(&log_line.label, &log_line);
>> +                            } else {
>> +                                // below limit, push to buffer to flush later
>> +                                log_lines.push(log_line);
>> +                            }
>> +                        }
>> +                        Entry::Vacant(vacant) => {
>> +                            vacant.insert(vec![log_line]);
>> +                        }
>> +                    }
>> +                }
>> +            }
>> +            return false;
>> +        }
>> +
>> +        // no more senders, all LogLineSender's and LogLineSenderBuilder have been dropped
> 
> nit: typo `'s`
> 
>> +        self.flush_all_buffered();
>> +        true
>> +    }
>> +
>> +    /// Flush all currently buffered contents without ordering, but grouped by label
>> +    fn flush_all_buffered(&mut self) {
>> +        for (label, log_lines) in self.buffer_map.iter() {
>> +            Self::log_with_label(label, log_lines);
>> +        }
>> +        self.buffer_map.clear();
> 
> wouldn't it be better performance wise to
> - clear each label's log lines (like in SendRequest::Flush)
> - remove the hashmap entry in SendRequest::Flush, or rename that one to
>    finish, since that is what it actually does?
> 
> granted, this only triggers when the timeout elapses or there are no
> more senders, but for the timeout case it might still be beneficial? it
> should remove a lot of allocation churn at least..

Yes, true. Fixed that as well.

> 
>> +    }
>> +
>> +    /// Log given log lines prefixed by label
>> +    fn log_with_label(label: &str, log_lines: &[LogLine]) {
> 
> currently each LogLine contains the label anyway, but see above, I do
> think this split makes sense but it should be done completely ;)

yes, makes more sense, split off as suggested

> 
>> +        for log_line in log_lines {
>> +            Self::log_by_level(label, log_line);
>> +        }
>> +    }
>> +
>> +    /// Write the given log line prefixed by label
>> +    fn log_by_level(label: &str, log_line: &LogLine) {
> 
> this also logs with label, IMHO the naming is confusing..
> 
>> +        match log_line.level {
>> +            Level::ERROR => error!("[{label}]: {}", log_line.message),
>> +            Level::WARN => warn!("[{label}]: {}", log_line.message),
>> +            Level::INFO => info!("[{label}]: {}", log_line.message),
>> +            Level::DEBUG => debug!("[{label}]: {}", log_line.message),
>> +            Level::TRACE => trace!("[{label}]: {}", log_line.message),
>> +        }
>> +    }
>> +}
>> diff --git a/pbs-tools/src/lib.rs b/pbs-tools/src/lib.rs
>> index f41aef6df..1e3972c92 100644
>> --- a/pbs-tools/src/lib.rs
>> +++ b/pbs-tools/src/lib.rs
>> @@ -1,4 +1,5 @@
>>   pub mod async_lru_cache;
>> +pub mod buffered_logger;
>>   pub mod cert;
>>   pub mod crypt_config;
>>   pub mod format;
>> -- 
>> 2.47.3
>>
>>
>>
>>
>>
>>





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 03/15] tools: implement buffered logger for concurrent log messages
  2026-04-20 17:15     ` Christian Ebner
@ 2026-04-21  6:49       ` Fabian Grünbichler
  0 siblings, 0 replies; 29+ messages in thread
From: Fabian Grünbichler @ 2026-04-21  6:49 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

On April 20, 2026 7:15 pm, Christian Ebner wrote:
> On 4/20/26 12:56 PM, Fabian Grünbichler wrote:
>> On April 17, 2026 11:26 am, Christian Ebner wrote:
>>> Implements a buffered logger instance which collects messages send
>>> from different sender instances via an async tokio channel and
>>> buffers them. Sender identify by label and provide a log level for
>>> each log line to be buffered and flushed.
>>>
>>> On collection, log lines are grouped by label and buffered in
>>> sequence of arrival per label, up to the configured maximum number of
>>> per group lines or periodically with the configured interval. The
>>> interval timeout is reset when contents are flushed. In addition,
>>> senders can request flushing at any given point.
>>>
>>> When the timeout set based on the interval is reached, all labels
>>> log buffers are flushed. There is no guarantee on the order of labels
>>> when flushing.
>>>
>>> Log output is written based on provided log line level and prefixed
>>> by the label.
>>>
>>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>

> [..]

>>> +    /// Starts the collection loop spawned on a new tokio task
>>> +    /// Finishes when all sender belonging to the channel have been dropped.
>>> +    pub fn run_log_collection(mut self) {
>>> +        let future = async move {
>>> +            loop {
>>> +                let deadline = Instant::now() + self.max_aggregation_time;
>>> +                match time::timeout_at(deadline, self.receive_log_line()).await {
>> 
>> why manually calculate the deadline, wouldn't using `time::timeout` work
>> as well? the only difference from a quick glance is that that one does a
>> checked_add for now + delay..
> 
> No specific reason for using timeout_at() here, was primed by having 
> based this on the s3 client timeout

I guess it depends how we tackle below, for some approaches timeout_at
with a deadline calculation that is not reset each iteration would also
be appropriate..

>> but also, isn't this kind of broken in any case? let's say I have two
>> labels A and B:
>> 
>>   0.99 A1
>>   1.98 A2
>>   2.97 A3
>>   3.96 A4
>>   4.95 A5 (now A is at capacity)
>>   5.94 B1
>>   9.90 B5 (now B is at capacity as well)
>> 
>> either
>> 
>> 10.90 timeout elapses, everything is flushed
>> 
>> or
>> 
>> 10.89 A6 (A gets flushed and can start over - but B hasn't been flushed)
>> 11.88 A7
>> 12.87 A8
>> 13.86 A9
>> 14.85 A10 (A has 5 buffered messages again)
>> ..
>> 
>> this means that any label that doesn't log a 6th message can stall for
>> quite a long time, as long as other labels make progress (and it isn't
>> flushed explicitly)?
> 
> Yes, this is true, but that is not really avoidable unless there is a 
> timeout per label. Or would you suggest to simply flush all buffered 
> lines at periodic intervals, without resetting at all?

yeah, I think we want to have a hard limit for the delay per label as
well, not just one for the delay-if-no-activity-at-all.

because the log timestamp is only added when emitting the log line to
the real logger, and if we delay too long the log doesn't reflect
reality anymore..

basically what we want to achieve here is that
- a single no-change sync should be emitted as one block of log lines,
  unless it takes unusually long
- a small burst of back-to-back log messages of the same group (e.g.,
  two warning lines) are emitted as one block, unless timing was really
  bad
- no individual log line should be delayed for more than N where N is
  rather small

that does mean we need to do some extra checking when handling the
messages coming in over the channel, because otherwise the channel
traffic could overwhelm the flushing logic and violate the third
property.

>>> +                    Ok(finished) => {
>>> +                        if finished {
>>> +                            break;
>>> +                        }
>>> +                    }
>>> +                    Err(_timeout) => self.flush_all_buffered(),
>>> +                }
>>> +            }
>>> +        };
>>> +        match LogContext::current() {
>>> +            None => tokio::spawn(future),
>>> +            Some(context) => tokio::spawn(context.scope(future)),
>>> +        };
>>> +    }
>>> +
>>> +    /// Collects new log lines, buffers and flushes them if max lines limit exceeded.
>>> +    ///
>>> +    /// Returns `true` if all the senders have been dropped and the task should no
>>> +    /// longer wait for new messages and finish.
>>> +    async fn receive_log_line(&mut self) -> bool {
>>> +        if let Some(request) = self.receiver.recv().await {
>>> +            match request {
>>> +                SenderRequest::Flush(label) => {
>>> +                    if let Some(log_lines) = self.buffer_map.get_mut(&label) {
>>> +                        Self::log_with_label(&label, log_lines);
>>> +                        log_lines.clear();
>>> +                    }
>>> +                }
>>> +                SenderRequest::Message(log_line) => {
>> 
>> if this would be Message((label, level, line)) or Message((label,
>> level_and_line)) the label would not need to be stored in the buffer
>> keys and values..
> 
> Yes, adapted based on above mention already
> 
>> 
>>> +                    if self.max_buffered_lines == 0
>>> +                        || self.max_aggregation_time < Duration::from_secs(0)
>> 
>> the timeout can never be below zero, as that is the minimum duration
>> (duration is unsigned)?
> 
> This is a typo, the intention was to check for durations below 1 second, 
> but since the granularity is seconds, this should check for 0 instead.

in that case, you can just call self.max_aggregation_time.is_zero() :)
though if we go with the trait/non-buffering implementation, this whole
check could go, since a buffered logger would never be instantiated with
buffering disabled?

>>> +                    {
>>> +                        // shortcut if no buffering should happen
>>> +                        Self::log_by_level(&log_line.label, &log_line);
>> 
>> shouldn't we rather handle this by not using the buffered logger in the
>> first place? e.g., have this and a simple not-buffering logger implement
>> a shared logging trait, or something similar?
> 
> Hmm, that might be better, yes. Will add a trait with 2 implementations 
> based on which logger is required.
> 
>> 
>> one simple approach would be to just make the LogLineSender log directly
>> in this case, and not send anything at all?
>> 
>> because if we don't want buffering, sending all log messages through a
>> channel and setting up the timeout machinery can be avoided completely..
>> 

> [..]




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context
  2026-04-20 11:56   ` Fabian Grünbichler
@ 2026-04-21  7:21     ` Christian Ebner
  2026-04-21  7:42       ` Christian Ebner
  2026-04-21 12:57     ` Thomas Lamprecht
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Ebner @ 2026-04-21  7:21 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/20/26 1:54 PM, Fabian Grünbichler wrote:
> On April 17, 2026 11:26 am, Christian Ebner wrote:
>> Pulling groups and therefore also snapshots in parallel leads to
>> unordered log outputs, making it mostly impossible to relate a log
>> message to a backup snapshot/group.
>>
>> Therefore, prefix pull job log messages by the corresponding group or
>> snapshot and set the error context accordingly.
>>
>> Also, reword some messages, inline variables in format strings and
>> start log lines with capital letters to get consistent output.
>>
>> By using the buffered logger implementation and buffer up to 5 lines
>> with a timeout of 1 second, subsequent log lines arriving in fast
>> succession are kept together, reducing the mixing of lines.
>>
>> Example output for a sequential pull job:
>> ```
>> ...
>> [ct/100]: 2025-11-17T10:11:42Z: start sync
>> [ct/100]: 2025-11-17T10:11:42Z: pct.conf.blob: sync archive
>> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: sync archive
>> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: downloaded 16.785 MiB (280.791 MiB/s)
>> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: sync archive
>> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: downloaded 65.703 KiB (29.1 MiB/s)
>> [ct/100]: 2025-11-17T10:11:42Z: sync done
>> [ct/100]: percentage done: 9.09% (1/11 groups)
>> [ct/101]: 2026-03-31T12:20:16Z: start sync
>> [ct/101]: 2026-03-31T12:20:16Z: pct.conf.blob: sync archive
>> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: sync archive
>> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: downloaded 199.806 MiB (311.91 MiB/s)
>> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: sync archive
>> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: downloaded 180.379 KiB (22.748 MiB/s)
>> [ct/101]: 2026-03-31T12:20:16Z: sync done
> 
> this is probably the wrong patch for this comment, but since you have
> the sample output here ;)
> 
> should we pad the labels? the buffered logger knows all currently active
> labels and could adapt it? otherwise especially for host backups the
> logs are again not really scannable by humans, because there will be
> weird jumps in alignment.. or (.. continued below)

That will not work though? While the logger knows about the labels when 
receiving them, it does not know at all about future ones that might 
still come in. So it can happen that the padding does not fit anymore, 
and one gets even uglier formatting issues.

I would prefer to keep them as is for the time being. For vm/ct it is 
not that bad, one could maybe forsee the ID to be a 4 digit number and 
define a minimum label lenght to be padded if not reached.

Host backups or backups with explicit label


>> ...
>> ```
>>
>> Example output for a parallel pull job:
>> ```
>> ...
>> [ct/107]: 2025-07-16T09:14:01Z: start sync
>> [ct/107]: 2025-07-16T09:14:01Z: pct.conf.blob: sync archive
>> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: sync archive
>> [vm/108]: 2025-09-19T07:37:19Z: start sync
>> [vm/108]: 2025-09-19T07:37:19Z: qemu-server.conf.blob: sync archive
>> [vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: sync archive
>> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: downloaded 609.233 MiB (112.628 MiB/s)
>> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: sync archive
>> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: downloaded 1.172 MiB (17.838 MiB/s)
>> [ct/107]: 2025-07-16T09:14:01Z: sync done
>> [ct/107]: percentage done: 72.73% (8/11 groups)
> 
> the way the prefix and snapshot are formatted could also be interpreted
> at first glance as a timestamp of the log line.. why not just prepend
> the prefix on the logger side, and leave it up to the caller to do the
> formatting? then we could us "{type}/{id}/" as prefix here? or add
> 'snapshot' in those lines to make it clear? granted, this is more of an
> issue when viewing the log via `proxmox-backup-manager`, as in the UI we
> have the log timestamps up front..

I simply followed along the line of your suggestion here. In prior 
versions I had exactly that but you rejected it as to verbose? :)

https://lore.proxmox.com/pbs-devel/1774263381.bngcrer2th.astroid@yuna.none/

> 
> and maybe(?) log the progress line using a different prefix? because
> right now the information that the group [ct/107] is finished is not
> really clear from the output, IMHO.

That I can add, yes.

> 
> the progress logging is also still broken (this is for a sync that takes
> a while, this is not log messages being buffered and re-ordered!):
> 
> $ proxmox-backup-manager task log 'UPID:yuna:00070656:001E420F:00000002:69E610EB:syncjob:local\x3atest\x3atank\x3a\x3as\x2dbc01cba6\x2d805a:root@pam:' | grep -e namespace -e 'percentage done'
> Syncing datastore 'test', root namespace into datastore 'tank', root namespace
> Finished syncing root namespace, current progress: 0 groups, 0 snapshots
> Syncing datastore 'test', namespace 'test' into datastore 'tank', namespace 'test'
> [host/exclusion-test]: percentage done: 5.26% (1/19 groups)
> [host/acltest]: percentage done: 5.26% (1/19 groups)
> [host/logtest]: percentage done: 5.26% (1/19 groups)
> [host/onemeg]: percentage done: 5.26% (1/19 groups)
> [host/fourmeg]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
> [host/symlink]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
> [host/symlink]: percentage done: 5.26% (1/19 groups)
> [host/fourmeg]: percentage done: 5.26% (1/19 groups)
> [host/format-v2-test]: percentage done: 1.75% (0/19 groups, 1/3 snapshots in group #1)
> [host/format-v2-test]: percentage done: 3.51% (0/19 groups, 2/3 snapshots in group #1)
> [host/format-v2-test]: percentage done: 5.26% (1/19 groups)
> [host/incrementaltest2]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
> [host/incrementaltest2]: percentage done: 5.26% (1/19 groups)

There will always be cases where it does not work I guess, not sure what 
I can do to make you happy here.

> 
>> [vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: downloaded 1.196 GiB (156.892 MiB/s)
>> [vm/108]: 2025-09-19T07:37:19Z: sync done
>> ...
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 5:
>> - uses BufferedLogger implementation, refactored accordingly
>> - improve log line prefixes
>> - add missing error contexts
>>
>>   src/server/pull.rs | 314 +++++++++++++++++++++++++++++++++------------
>>   src/server/sync.rs |   8 +-
>>   2 files changed, 237 insertions(+), 85 deletions(-)
>>
>> diff --git a/src/server/pull.rs b/src/server/pull.rs
>> index 611441d2a..f7aae4d59 100644
>> --- a/src/server/pull.rs
>> +++ b/src/server/pull.rs
>> @@ -5,11 +5,11 @@ use std::collections::{HashMap, HashSet};
>>   use std::io::Seek;
>>   use std::sync::atomic::{AtomicUsize, Ordering};
>>   use std::sync::{Arc, Mutex};
>> -use std::time::SystemTime;
>> +use std::time::{Duration, SystemTime};
>>   
>>   use anyhow::{bail, format_err, Context, Error};
>>   use proxmox_human_byte::HumanByte;
>> -use tracing::{info, warn};
>> +use tracing::{info, Level};
>>   
>>   use pbs_api_types::{
>>       print_store_and_ns, ArchiveType, Authid, BackupArchiveName, BackupDir, BackupGroup,
>> @@ -27,6 +27,7 @@ use pbs_datastore::manifest::{BackupManifest, FileInfo};
>>   use pbs_datastore::read_chunk::AsyncReadChunk;
>>   use pbs_datastore::{check_backup_owner, DataStore, DatastoreBackend, StoreProgress};
>>   use pbs_tools::bounded_join_set::BoundedJoinSet;
>> +use pbs_tools::buffered_logger::{BufferedLogger, LogLineSender};
>>   use pbs_tools::sha::sha256;
>>   
>>   use super::sync::{
>> @@ -153,6 +154,8 @@ async fn pull_index_chunks<I: IndexFile>(
>>       index: I,
>>       encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>>       backend: &DatastoreBackend,
>> +    archive_prefix: &str,
>> +    log_sender: Arc<LogLineSender>,
>>   ) -> Result<SyncStats, Error> {
>>       use futures::stream::{self, StreamExt, TryStreamExt};
>>   
>> @@ -247,11 +250,16 @@ async fn pull_index_chunks<I: IndexFile>(
>>       let bytes = bytes.load(Ordering::SeqCst);
>>       let chunk_count = chunk_count.load(Ordering::SeqCst);
>>   
>> -    info!(
>> -        "downloaded {} ({}/s)",
>> -        HumanByte::from(bytes),
>> -        HumanByte::new_binary(bytes as f64 / elapsed.as_secs_f64()),
>> -    );
>> +    log_sender
>> +        .log(
>> +            Level::INFO,
>> +            format!(
>> +                "{archive_prefix}: downloaded {} ({}/s)",
>> +                HumanByte::from(bytes),
>> +                HumanByte::new_binary(bytes as f64 / elapsed.as_secs_f64()),
>> +            ),
>> +        )
>> +        .await?;
>>   
>>       Ok(SyncStats {
>>           chunk_count,
>> @@ -292,6 +300,7 @@ async fn pull_single_archive<'a>(
>>       archive_info: &'a FileInfo,
>>       encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>>       backend: &DatastoreBackend,
>> +    log_sender: Arc<LogLineSender>,
>>   ) -> Result<SyncStats, Error> {
>>       let archive_name = &archive_info.filename;
>>       let mut path = snapshot.full_path();
>> @@ -302,72 +311,104 @@ async fn pull_single_archive<'a>(
>>   
>>       let mut sync_stats = SyncStats::default();
>>   
>> -    info!("sync archive {archive_name}");
>> +    let archive_prefix = format!("{}: {archive_name}", snapshot.backup_time_string());
>> +
>> +    log_sender
>> +        .log(Level::INFO, format!("{archive_prefix}: sync archive"))
>> +        .await?;
>>   
>> -    reader.load_file_into(archive_name, &tmp_path).await?;
>> +    reader
>> +        .load_file_into(archive_name, &tmp_path)
>> +        .await
>> +        .with_context(|| archive_prefix.clone())?;
>>   
>> -    let mut tmpfile = std::fs::OpenOptions::new().read(true).open(&tmp_path)?;
>> +    let mut tmpfile = std::fs::OpenOptions::new()
>> +        .read(true)
>> +        .open(&tmp_path)
>> +        .with_context(|| archive_prefix.clone())?;
>>   
>>       match ArchiveType::from_path(archive_name)? {
>>           ArchiveType::DynamicIndex => {
>>               let index = DynamicIndexReader::new(tmpfile).map_err(|err| {
>> -                format_err!("unable to read dynamic index {:?} - {}", tmp_path, err)
>> +                format_err!("{archive_prefix}: unable to read dynamic index {tmp_path:?} - {err}")
>>               })?;
>>               let (csum, size) = index.compute_csum();
>> -            verify_archive(archive_info, &csum, size)?;
>> +            verify_archive(archive_info, &csum, size).with_context(|| archive_prefix.clone())?;
>>   
>>               if reader.skip_chunk_sync(snapshot.datastore().name()) {
>> -                info!("skipping chunk sync for same datastore");
>> +                log_sender
>> +                    .log(
>> +                        Level::INFO,
>> +                        format!("{archive_prefix}: skipping chunk sync for same datastore"),
>> +                    )
>> +                    .await?;
>>               } else {
>>                   let stats = pull_index_chunks(
>>                       reader
>>                           .chunk_reader(archive_info.crypt_mode)
>> -                        .context("failed to get chunk reader")?,
>> +                        .context("failed to get chunk reader")
>> +                        .with_context(|| archive_prefix.clone())?,
>>                       snapshot.datastore().clone(),
>>                       index,
>>                       encountered_chunks,
>>                       backend,
>> +                    &archive_prefix,
>> +                    Arc::clone(&log_sender),
>>                   )
>> -                .await?;
>> +                .await
>> +                .with_context(|| archive_prefix.clone())?;
>>                   sync_stats.add(stats);
>>               }
>>           }
>>           ArchiveType::FixedIndex => {
>>               let index = FixedIndexReader::new(tmpfile).map_err(|err| {
>> -                format_err!("unable to read fixed index '{:?}' - {}", tmp_path, err)
>> +                format_err!("{archive_name}: unable to read fixed index '{tmp_path:?}' - {err}")
>>               })?;
>>               let (csum, size) = index.compute_csum();
>> -            verify_archive(archive_info, &csum, size)?;
>> +            verify_archive(archive_info, &csum, size).with_context(|| archive_prefix.clone())?;
>>   
>>               if reader.skip_chunk_sync(snapshot.datastore().name()) {
>> -                info!("skipping chunk sync for same datastore");
>> +                log_sender
>> +                    .log(
>> +                        Level::INFO,
>> +                        format!("{archive_prefix}: skipping chunk sync for same datastore"),
>> +                    )
>> +                    .await?;
>>               } else {
>>                   let stats = pull_index_chunks(
>>                       reader
>>                           .chunk_reader(archive_info.crypt_mode)
>> -                        .context("failed to get chunk reader")?,
>> +                        .context("failed to get chunk reader")
>> +                        .with_context(|| archive_prefix.clone())?,
>>                       snapshot.datastore().clone(),
>>                       index,
>>                       encountered_chunks,
>>                       backend,
>> +                    &archive_prefix,
>> +                    Arc::clone(&log_sender),
>>                   )
>> -                .await?;
>> +                .await
>> +                .with_context(|| archive_prefix.clone())?;
>>                   sync_stats.add(stats);
>>               }
>>           }
>>           ArchiveType::Blob => {
>> -            tmpfile.rewind()?;
>> -            let (csum, size) = sha256(&mut tmpfile)?;
>> -            verify_archive(archive_info, &csum, size)?;
>> +            proxmox_lang::try_block!({
>> +                tmpfile.rewind()?;
>> +                let (csum, size) = sha256(&mut tmpfile)?;
>> +                verify_archive(archive_info, &csum, size)
>> +            })
>> +            .with_context(|| archive_prefix.clone())?;
>>           }
>>       }
>>       if let Err(err) = std::fs::rename(&tmp_path, &path) {
>> -        bail!("Atomic rename file {:?} failed - {}", path, err);
>> +        bail!("{archive_prefix}: Atomic rename file {path:?} failed - {err}");
>>       }
>>   
>>       backend
>>           .upload_index_to_backend(snapshot, archive_name)
>> -        .await?;
>> +        .await
>> +        .with_context(|| archive_prefix.clone())?;
>>   
>>       Ok(sync_stats)
>>   }
>> @@ -388,13 +429,24 @@ async fn pull_snapshot<'a>(
>>       encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>>       corrupt: bool,
>>       is_new: bool,
>> +    log_sender: Arc<LogLineSender>,
>>   ) -> Result<SyncStats, Error> {
>> +    let prefix = snapshot.backup_time_string().to_owned();
>>       if is_new {
>> -        info!("sync snapshot {}", snapshot.dir());
>> +        log_sender
>> +            .log(Level::INFO, format!("{prefix}: start sync"))
>> +            .await?;
>>       } else if corrupt {
>> -        info!("re-sync snapshot {} due to corruption", snapshot.dir());
>> +        log_sender
>> +            .log(
>> +                Level::INFO,
>> +                format!("re-sync snapshot {prefix} due to corruption"),
>> +            )
>> +            .await?;
>>       } else {
>> -        info!("re-sync snapshot {}", snapshot.dir());
>> +        log_sender
>> +            .log(Level::INFO, format!("re-sync snapshot {prefix}"))
>> +            .await?;
>>       }
>>   
>>       let mut sync_stats = SyncStats::default();
>> @@ -409,7 +461,8 @@ async fn pull_snapshot<'a>(
>>       let tmp_manifest_blob;
>>       if let Some(data) = reader
>>           .load_file_into(MANIFEST_BLOB_NAME.as_ref(), &tmp_manifest_name)
>> -        .await?
>> +        .await
>> +        .with_context(|| prefix.clone())?
>>       {
>>           tmp_manifest_blob = data;
>>       } else {
>> @@ -419,28 +472,34 @@ async fn pull_snapshot<'a>(
>>       if manifest_name.exists() && !corrupt {
>>           let manifest_blob = proxmox_lang::try_block!({
>>               let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> -                format_err!("unable to open local manifest {manifest_name:?} - {err}")
>> +                format_err!("{prefix}: unable to open local manifest {manifest_name:?} - {err}")
>>               })?;
>>   
>> -            let manifest_blob = DataBlob::load_from_reader(&mut manifest_file)?;
>> +            let manifest_blob =
>> +                DataBlob::load_from_reader(&mut manifest_file).with_context(|| prefix.clone())?;
>>               Ok(manifest_blob)
>>           })
>>           .map_err(|err: Error| {
>> -            format_err!("unable to read local manifest {manifest_name:?} - {err}")
>> +            format_err!("{prefix}: unable to read local manifest {manifest_name:?} - {err}")
>>           })?;
>>   
>>           if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
>>               if !client_log_name.exists() {
>> -                reader.try_download_client_log(&client_log_name).await?;
>> +                reader
>> +                    .try_download_client_log(&client_log_name)
>> +                    .await
>> +                    .with_context(|| prefix.clone())?;
>>               };
>> -            info!("no data changes");
>> +            log_sender
>> +                .log(Level::INFO, format!("{prefix}: no data changes"))
>> +                .await?;
>>               let _ = std::fs::remove_file(&tmp_manifest_name);
>>               return Ok(sync_stats); // nothing changed
>>           }
>>       }
>>   
>>       let manifest_data = tmp_manifest_blob.raw_data().to_vec();
>> -    let manifest = BackupManifest::try_from(tmp_manifest_blob)?;
>> +    let manifest = BackupManifest::try_from(tmp_manifest_blob).with_context(|| prefix.clone())?;
>>   
>>       if ignore_not_verified_or_encrypted(
>>           &manifest,
>> @@ -464,35 +523,54 @@ async fn pull_snapshot<'a>(
>>           path.push(&item.filename);
>>   
>>           if !corrupt && path.exists() {
>> -            let filename: BackupArchiveName = item.filename.as_str().try_into()?;
>> +            let filename: BackupArchiveName = item
>> +                .filename
>> +                .as_str()
>> +                .try_into()
>> +                .with_context(|| prefix.clone())?;
>>               match filename.archive_type() {
>>                   ArchiveType::DynamicIndex => {
>> -                    let index = DynamicIndexReader::open(&path)?;
>> +                    let index = DynamicIndexReader::open(&path).with_context(|| prefix.clone())?;
>>                       let (csum, size) = index.compute_csum();
>>                       match manifest.verify_file(&filename, &csum, size) {
>>                           Ok(_) => continue,
>>                           Err(err) => {
>> -                            info!("detected changed file {path:?} - {err}");
>> +                            log_sender
>> +                                .log(
>> +                                    Level::INFO,
>> +                                    format!("{prefix}: detected changed file {path:?} - {err}"),
>> +                                )
>> +                                .await?;
>>                           }
>>                       }
>>                   }
>>                   ArchiveType::FixedIndex => {
>> -                    let index = FixedIndexReader::open(&path)?;
>> +                    let index = FixedIndexReader::open(&path).with_context(|| prefix.clone())?;
>>                       let (csum, size) = index.compute_csum();
>>                       match manifest.verify_file(&filename, &csum, size) {
>>                           Ok(_) => continue,
>>                           Err(err) => {
>> -                            info!("detected changed file {path:?} - {err}");
>> +                            log_sender
>> +                                .log(
>> +                                    Level::INFO,
>> +                                    format!("{prefix}: detected changed file {path:?} - {err}"),
>> +                                )
>> +                                .await?;
>>                           }
>>                       }
>>                   }
>>                   ArchiveType::Blob => {
>> -                    let mut tmpfile = std::fs::File::open(&path)?;
>> -                    let (csum, size) = sha256(&mut tmpfile)?;
>> +                    let mut tmpfile = std::fs::File::open(&path).with_context(|| prefix.clone())?;
>> +                    let (csum, size) = sha256(&mut tmpfile).with_context(|| prefix.clone())?;
>>                       match manifest.verify_file(&filename, &csum, size) {
>>                           Ok(_) => continue,
>>                           Err(err) => {
>> -                            info!("detected changed file {path:?} - {err}");
>> +                            log_sender
>> +                                .log(
>> +                                    Level::INFO,
>> +                                    format!("{prefix}: detected changed file {path:?} - {err}"),
>> +                                )
>> +                                .await?;
>>                           }
>>                       }
>>                   }
>> @@ -505,13 +583,14 @@ async fn pull_snapshot<'a>(
>>               item,
>>               encountered_chunks.clone(),
>>               backend,
>> +            Arc::clone(&log_sender),
>>           )
>>           .await?;
>>           sync_stats.add(stats);
>>       }
>>   
>>       if let Err(err) = std::fs::rename(&tmp_manifest_name, &manifest_name) {
>> -        bail!("Atomic rename file {:?} failed - {}", manifest_name, err);
>> +        bail!("{prefix}: Atomic rename file {manifest_name:?} failed - {err}");
>>       }
>>       if let DatastoreBackend::S3(s3_client) = backend {
>>           let object_key = pbs_datastore::s3::object_key_from_path(
>> @@ -524,33 +603,40 @@ async fn pull_snapshot<'a>(
>>           let _is_duplicate = s3_client
>>               .upload_replace_with_retry(object_key, data)
>>               .await
>> -            .context("failed to upload manifest to s3 backend")?;
>> +            .context("failed to upload manifest to s3 backend")
>> +            .with_context(|| prefix.clone())?;
>>       }
>>   
>>       if !client_log_name.exists() {
>> -        reader.try_download_client_log(&client_log_name).await?;
>> +        reader
>> +            .try_download_client_log(&client_log_name)
>> +            .await
>> +            .with_context(|| prefix.clone())?;
>>           if client_log_name.exists() {
>>               if let DatastoreBackend::S3(s3_client) = backend {
>>                   let object_key = pbs_datastore::s3::object_key_from_path(
>>                       &snapshot.relative_path(),
>>                       CLIENT_LOG_BLOB_NAME.as_ref(),
>>                   )
>> -                .context("invalid archive object key")?;
>> +                .context("invalid archive object key")
>> +                .with_context(|| prefix.clone())?;
>>   
>>                   let data = tokio::fs::read(&client_log_name)
>>                       .await
>> -                    .context("failed to read log file contents")?;
>> +                    .context("failed to read log file contents")
>> +                    .with_context(|| prefix.clone())?;
>>                   let contents = hyper::body::Bytes::from(data);
>>                   let _is_duplicate = s3_client
>>                       .upload_replace_with_retry(object_key, contents)
>>                       .await
>> -                    .context("failed to upload client log to s3 backend")?;
>> +                    .context("failed to upload client log to s3 backend")
>> +                    .with_context(|| prefix.clone())?;
>>               }
>>           }
>>       };
>>       snapshot
>>           .cleanup_unreferenced_files(&manifest)
>> -        .map_err(|err| format_err!("failed to cleanup unreferenced files - {err}"))?;
>> +        .map_err(|err| format_err!("{prefix}: failed to cleanup unreferenced files - {err}"))?;
>>   
>>       Ok(sync_stats)
>>   }
>> @@ -565,10 +651,14 @@ async fn pull_snapshot_from<'a>(
>>       snapshot: &'a pbs_datastore::BackupDir,
>>       encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>>       corrupt: bool,
>> +    log_sender: Arc<LogLineSender>,
>>   ) -> Result<SyncStats, Error> {
>> +    let prefix = format!("{}", snapshot.backup_time_string());
>> +
>>       let (_path, is_new, _snap_lock) = snapshot
>>           .datastore()
>> -        .create_locked_backup_dir(snapshot.backup_ns(), snapshot.as_ref())?;
>> +        .create_locked_backup_dir(snapshot.backup_ns(), snapshot.as_ref())
>> +        .context(prefix.clone())?;
>>   
>>       let result = pull_snapshot(
>>           params,
>> @@ -577,6 +667,7 @@ async fn pull_snapshot_from<'a>(
>>           encountered_chunks,
>>           corrupt,
>>           is_new,
>> +        Arc::clone(&log_sender),
>>       )
>>       .await;
>>   
>> @@ -589,11 +680,20 @@ async fn pull_snapshot_from<'a>(
>>                       snapshot.as_ref(),
>>                       true,
>>                   ) {
>> -                    info!("cleanup error - {cleanup_err}");
>> +                    log_sender
>> +                        .log(
>> +                            Level::INFO,
>> +                            format!("{prefix}: cleanup error - {cleanup_err}"),
>> +                        )
>> +                        .await?;
>>                   }
>>                   return Err(err);
>>               }
>> -            Ok(_) => info!("sync snapshot {} done", snapshot.dir()),
>> +            Ok(_) => {
>> +                log_sender
>> +                    .log(Level::INFO, format!("{prefix}: sync done"))
>> +                    .await?
>> +            }
>>           }
>>       }
>>   
>> @@ -622,7 +722,9 @@ async fn pull_group(
>>       source_namespace: &BackupNamespace,
>>       group: &BackupGroup,
>>       shared_group_progress: Arc<SharedGroupProgress>,
>> +    log_sender: Arc<LogLineSender>,
>>   ) -> Result<SyncStats, Error> {
>> +    let prefix = format!("{group}");
>>       let mut already_synced_skip_info = SkipInfo::new(SkipReason::AlreadySynced);
>>       let mut transfer_last_skip_info = SkipInfo::new(SkipReason::TransferLast);
>>   
>> @@ -714,11 +816,15 @@ async fn pull_group(
>>           .collect();
>>   
>>       if already_synced_skip_info.count > 0 {
>> -        info!("{already_synced_skip_info}");
>> +        log_sender
>> +            .log(Level::INFO, format!("{prefix}: {already_synced_skip_info}"))
>> +            .await?;
>>           already_synced_skip_info.reset();
>>       }
>>       if transfer_last_skip_info.count > 0 {
>> -        info!("{transfer_last_skip_info}");
>> +        log_sender
>> +            .log(Level::INFO, format!("{prefix}: {transfer_last_skip_info}"))
>> +            .await?;
>>           transfer_last_skip_info.reset();
>>       }
>>   
>> @@ -730,8 +836,8 @@ async fn pull_group(
>>           .store
>>           .backup_group(target_ns.clone(), group.clone());
>>       if let Some(info) = backup_group.last_backup(true).unwrap_or(None) {
>> -        let mut reusable_chunks = encountered_chunks.lock().unwrap();
>>           if let Err(err) = proxmox_lang::try_block!({
>> +            let mut reusable_chunks = encountered_chunks.lock().unwrap();
>>               let _snapshot_guard = info
>>                   .backup_dir
>>                   .lock_shared()
>> @@ -780,7 +886,12 @@ async fn pull_group(
>>               }
>>               Ok::<(), Error>(())
>>           }) {
>> -            warn!("Failed to collect reusable chunk from last backup: {err:#?}");
>> +            log_sender
>> +                .log(
>> +                    Level::WARN,
>> +                    format!("Failed to collect reusable chunk from last backup: {err:#?}"),
>> +                )
>> +                .await?;
>>           }
>>       }
>>   
>> @@ -805,13 +916,16 @@ async fn pull_group(
>>               &to_snapshot,
>>               encountered_chunks.clone(),
>>               corrupt,
>> +            Arc::clone(&log_sender),
>>           )
>>           .await;
>>   
>>           // Update done groups progress by other parallel running pulls
>>           local_progress.done_groups = shared_group_progress.load_done();
>>           local_progress.done_snapshots = pos as u64 + 1;
>> -        info!("percentage done: group {group}: {local_progress}");
>> +        log_sender
>> +            .log(Level::INFO, format!("percentage done: {local_progress}"))
>> +            .await?;
>>   
>>           let stats = result?; // stop on error
>>           sync_stats.add(stats);
>> @@ -829,13 +943,23 @@ async fn pull_group(
>>                   continue;
>>               }
>>               if snapshot.is_protected() {
>> -                info!(
>> -                    "don't delete vanished snapshot {} (protected)",
>> -                    snapshot.dir()
>> -                );
>> +                log_sender
>> +                    .log(
>> +                        Level::INFO,
>> +                        format!(
>> +                            "{prefix}: don't delete vanished snapshot {} (protected)",
>> +                            snapshot.dir(),
>> +                        ),
>> +                    )
>> +                    .await?;
>>                   continue;
>>               }
>> -            info!("delete vanished snapshot {}", snapshot.dir());
>> +            log_sender
>> +                .log(
>> +                    Level::INFO,
>> +                    format!("delete vanished snapshot {}", snapshot.dir()),
>> +                )
>> +                .await?;
>>               params
>>                   .target
>>                   .store
>> @@ -1035,10 +1159,7 @@ pub(crate) async fn pull_store(mut params: PullParameters) -> Result<SyncStats,
>>               }
>>               Err(err) => {
>>                   errors = true;
>> -                info!(
>> -                    "Encountered errors while syncing namespace {} - {err}",
>> -                    &namespace,
>> -                );
>> +                info!("Encountered errors while syncing namespace {namespace} - {err}");
>>               }
>>           };
>>       }
>> @@ -1064,6 +1185,7 @@ async fn lock_and_pull_group(
>>       namespace: &BackupNamespace,
>>       target_namespace: &BackupNamespace,
>>       shared_group_progress: Arc<SharedGroupProgress>,
>> +    log_sender: Arc<LogLineSender>,
>>   ) -> Result<SyncStats, Error> {
>>       let (owner, _lock_guard) =
>>           match params
>> @@ -1073,25 +1195,47 @@ async fn lock_and_pull_group(
>>           {
>>               Ok(res) => res,
>>               Err(err) => {
>> -                info!("sync group {group} failed - group lock failed: {err}");
>> -                info!("create_locked_backup_group failed");
>> +                log_sender
>> +                    .log(
>> +                        Level::INFO,
>> +                        format!("sync group {group} failed - group lock failed: {err}"),
>> +                    )
>> +                    .await?;
>> +                log_sender
>> +                    .log(Level::INFO, "create_locked_backup_group failed".to_string())
>> +                    .await?;
>>                   return Err(err);
>>               }
>>           };
>>   
>>       if params.owner != owner {
>>           // only the owner is allowed to create additional snapshots
>> -        info!(
>> -            "sync group {group} failed - owner check failed ({} != {owner})",
>> -            params.owner
>> -        );
>> +        log_sender
>> +            .log(
>> +                Level::INFO,
>> +                format!(
>> +                    "sync group {group} failed - owner check failed ({} != {owner})",
>> +                    params.owner,
>> +                ),
>> +            )
>> +            .await?;
>>           return Err(format_err!("owner check failed"));
>>       }
>>   
>> -    match pull_group(params, namespace, group, shared_group_progress).await {
>> +    match pull_group(
>> +        params,
>> +        namespace,
>> +        group,
>> +        shared_group_progress,
>> +        Arc::clone(&log_sender),
>> +    )
>> +    .await
>> +    {
>>           Ok(stats) => Ok(stats),
>>           Err(err) => {
>> -            info!("sync group {group} failed - {err:#}");
>> +            log_sender
>> +                .log(Level::INFO, format!("sync group {group} failed - {err:#}"))
>> +                .await?;
>>               Err(err)
>>           }
>>       }
>> @@ -1124,7 +1268,7 @@ async fn pull_ns(
>>       list.sort_unstable();
>>   
>>       info!(
>> -        "found {} groups to sync (out of {unfiltered_count} total)",
>> +        "Found {} groups to sync (out of {unfiltered_count} total)",
>>           list.len()
>>       );
>>   
>> @@ -1143,6 +1287,10 @@ async fn pull_ns(
>>       let shared_group_progress = Arc::new(SharedGroupProgress::with_total_groups(list.len()));
>>       let mut group_workers = BoundedJoinSet::new(params.worker_threads.unwrap_or(1));
>>   
>> +    let (buffered_logger, sender_builder) = BufferedLogger::new(5, Duration::from_secs(1));
>> +    // runs until sender_builder and all senders build from it are being dropped
>> +    buffered_logger.run_log_collection();
>> +
>>       let mut process_results = |results| {
>>           for result in results {
>>               match result {
>> @@ -1160,16 +1308,20 @@ async fn pull_ns(
>>           let target_ns = target_ns.clone();
>>           let params = Arc::clone(&params);
>>           let group_progress_cloned = Arc::clone(&shared_group_progress);
>> +        let log_sender = Arc::new(sender_builder.sender_with_label(group.to_string()));
>>           let results = group_workers
>>               .spawn_task(async move {
>> -                lock_and_pull_group(
>> +                let result = lock_and_pull_group(
>>                       Arc::clone(&params),
>>                       &group,
>>                       &namespace,
>>                       &target_ns,
>>                       group_progress_cloned,
>> +                    Arc::clone(&log_sender),
>>                   )
>> -                .await
>> +                .await;
>> +                let _ = log_sender.flush().await;
>> +                result
>>               })
>>               .await
>>               .map_err(|err| format_err!("failed to join on worker task: {err:#}"))?;
>> @@ -1197,7 +1349,7 @@ async fn pull_ns(
>>                   if !local_group.apply_filters(&params.group_filter) {
>>                       continue;
>>                   }
>> -                info!("delete vanished group '{local_group}'");
>> +                info!("Delete vanished group '{local_group}'");
>>                   let delete_stats_result = params
>>                       .target
>>                       .store
>> @@ -1206,7 +1358,7 @@ async fn pull_ns(
>>                   match delete_stats_result {
>>                       Ok(stats) => {
>>                           if !stats.all_removed() {
>> -                            info!("kept some protected snapshots of group '{local_group}'");
>> +                            info!("Kept some protected snapshots of group '{local_group}'");
>>                               sync_stats.add(SyncStats::from(RemovedVanishedStats {
>>                                   snapshots: stats.removed_snapshots(),
>>                                   groups: 0,
>> @@ -1229,7 +1381,7 @@ async fn pull_ns(
>>               Ok(())
>>           });
>>           if let Err(err) = result {
>> -            info!("error during cleanup: {err}");
>> +            info!("Error during cleanup: {err}");
>>               errors = true;
>>           };
>>       }
>> diff --git a/src/server/sync.rs b/src/server/sync.rs
>> index e88418442..17ed4839f 100644
>> --- a/src/server/sync.rs
>> +++ b/src/server/sync.rs
>> @@ -13,7 +13,6 @@ use futures::{future::FutureExt, select};
>>   use hyper::http::StatusCode;
>>   use pbs_config::BackupLockGuard;
>>   use serde_json::json;
>> -use tokio::task::JoinSet;
>>   use tracing::{info, warn};
>>   
>>   use proxmox_human_byte::HumanByte;
>> @@ -136,13 +135,13 @@ impl SyncSourceReader for RemoteSourceReader {
>>                   Some(HttpError { code, message }) => match *code {
>>                       StatusCode::NOT_FOUND => {
>>                           info!(
>> -                            "skipping snapshot {} - vanished since start of sync",
>> +                            "Snapshot {}: skipped because vanished since start of sync",
>>                               &self.dir
>>                           );
>>                           return Ok(None);
>>                       }
>>                       _ => {
>> -                        bail!("HTTP error {code} - {message}");
>> +                        bail!("Snapshot {}: HTTP error {code} - {message}", &self.dir);
>>                       }
>>                   },
>>                   None => {
>> @@ -176,7 +175,8 @@ impl SyncSourceReader for RemoteSourceReader {
>>                   bail!("Atomic rename file {to_path:?} failed - {err}");
>>               }
>>               info!(
>> -                "got backup log file {client_log_name}",
>> +                "Snapshot {snapshot}: got backup log file {client_log_name}",
>> +                snapshot = &self.dir,
>>                   client_log_name = client_log_name.deref()
>>               );
>>           }
>> -- 
>> 2.47.3
>>
>>
>>
>>
>>
>>





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context
  2026-04-21  7:21     ` Christian Ebner
@ 2026-04-21  7:42       ` Christian Ebner
  2026-04-21  8:00         ` Fabian Grünbichler
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Ebner @ 2026-04-21  7:42 UTC (permalink / raw)
  To: pbs-devel

On 4/21/26 9:21 AM, Christian Ebner wrote:
> On 4/20/26 1:54 PM, Fabian Grünbichler wrote:
>> On April 17, 2026 11:26 am, Christian Ebner wrote:
>>> Pulling groups and therefore also snapshots in parallel leads to
>>> unordered log outputs, making it mostly impossible to relate a log
>>> message to a backup snapshot/group.
>>>
>>> Therefore, prefix pull job log messages by the corresponding group or
>>> snapshot and set the error context accordingly.
>>>
>>> Also, reword some messages, inline variables in format strings and
>>> start log lines with capital letters to get consistent output.
>>>
>>> By using the buffered logger implementation and buffer up to 5 lines
>>> with a timeout of 1 second, subsequent log lines arriving in fast
>>> succession are kept together, reducing the mixing of lines.
>>>
>>> Example output for a sequential pull job:
>>> ```
>>> ...
>>> [ct/100]: 2025-11-17T10:11:42Z: start sync
>>> [ct/100]: 2025-11-17T10:11:42Z: pct.conf.blob: sync archive
>>> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: sync archive
>>> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: downloaded 16.785 
>>> MiB (280.791 MiB/s)
>>> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: sync archive
>>> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: downloaded 65.703 
>>> KiB (29.1 MiB/s)
>>> [ct/100]: 2025-11-17T10:11:42Z: sync done
>>> [ct/100]: percentage done: 9.09% (1/11 groups)
>>> [ct/101]: 2026-03-31T12:20:16Z: start sync
>>> [ct/101]: 2026-03-31T12:20:16Z: pct.conf.blob: sync archive
>>> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: sync archive
>>> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: downloaded 199.806 
>>> MiB (311.91 MiB/s)
>>> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: sync archive
>>> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: downloaded 
>>> 180.379 KiB (22.748 MiB/s)
>>> [ct/101]: 2026-03-31T12:20:16Z: sync done
>>
>> this is probably the wrong patch for this comment, but since you have
>> the sample output here ;)
>>
>> should we pad the labels? the buffered logger knows all currently active
>> labels and could adapt it? otherwise especially for host backups the
>> logs are again not really scannable by humans, because there will be
>> weird jumps in alignment.. or (.. continued below)
> 
> That will not work though? While the logger knows about the labels when 
> receiving them, it does not know at all about future ones that might 
> still come in. So it can happen that the padding does not fit anymore, 
> and one gets even uglier formatting issues.
> 
> I would prefer to keep them as is for the time being. For vm/ct it is 
> not that bad, one could maybe forsee the ID to be a 4 digit number and 
> define a minimum label lenght to be padded if not reached.
> 
> Host backups or backups with explicit label
> 
> 
>>> ...
>>> ```
>>>
>>> Example output for a parallel pull job:
>>> ```
>>> ...
>>> [ct/107]: 2025-07-16T09:14:01Z: start sync
>>> [ct/107]: 2025-07-16T09:14:01Z: pct.conf.blob: sync archive
>>> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: sync archive
>>> [vm/108]: 2025-09-19T07:37:19Z: start sync
>>> [vm/108]: 2025-09-19T07:37:19Z: qemu-server.conf.blob: sync archive
>>> [vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: sync archive
>>> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: downloaded 609.233 
>>> MiB (112.628 MiB/s)
>>> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: sync archive
>>> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: downloaded 1.172 MiB 
>>> (17.838 MiB/s)
>>> [ct/107]: 2025-07-16T09:14:01Z: sync done
>>> [ct/107]: percentage done: 72.73% (8/11 groups)
>>
>> the way the prefix and snapshot are formatted could also be interpreted
>> at first glance as a timestamp of the log line.. why not just prepend
>> the prefix on the logger side, and leave it up to the caller to do the
>> formatting? then we could us "{type}/{id}/" as prefix here? or add
>> 'snapshot' in those lines to make it clear? granted, this is more of an
>> issue when viewing the log via `proxmox-backup-manager`, as in the UI we
>> have the log timestamps up front..
> 
> I simply followed along the line of your suggestion here. In prior 
> versions I had exactly that but you rejected it as to verbose? :)
> 
> https://lore.proxmox.com/pbs-devel/1774263381.bngcrer2th.astroid@yuna.none/
> 
>>
>> and maybe(?) log the progress line using a different prefix? because
>> right now the information that the group [ct/107] is finished is not
>> really clear from the output, IMHO.
> 
> That I can add, yes.
> 
>>
>> the progress logging is also still broken (this is for a sync that takes
>> a while, this is not log messages being buffered and re-ordered!):
>>
>> $ proxmox-backup-manager task log 
>> 'UPID:yuna:00070656:001E420F:00000002:69E610EB:syncjob:local\x3atest\x3atank\x3a\x3as\x2dbc01cba6\x2d805a:root@pam:' | grep -e namespace -e 'percentage done'
>> Syncing datastore 'test', root namespace into datastore 'tank', root 
>> namespace
>> Finished syncing root namespace, current progress: 0 groups, 0 snapshots
>> Syncing datastore 'test', namespace 'test' into datastore 'tank', 
>> namespace 'test'
>> [host/exclusion-test]: percentage done: 5.26% (1/19 groups)
>> [host/acltest]: percentage done: 5.26% (1/19 groups)
>> [host/logtest]: percentage done: 5.26% (1/19 groups)
>> [host/onemeg]: percentage done: 5.26% (1/19 groups)
>> [host/fourmeg]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in 
>> group #1)
>> [host/symlink]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in 
>> group #1)
>> [host/symlink]: percentage done: 5.26% (1/19 groups)
>> [host/fourmeg]: percentage done: 5.26% (1/19 groups)
>> [host/format-v2-test]: percentage done: 1.75% (0/19 groups, 1/3 
>> snapshots in group #1)
>> [host/format-v2-test]: percentage done: 3.51% (0/19 groups, 2/3 
>> snapshots in group #1)
>> [host/format-v2-test]: percentage done: 5.26% (1/19 groups)
>> [host/incrementaltest2]: percentage done: 2.63% (0/19 groups, 1/2 
>> snapshots in group #1)
>> [host/incrementaltest2]: percentage done: 5.26% (1/19 groups)
> 
> There will always be cases where it does not work I guess, not sure what 
> I can do to make you happy here.

Oh, sorry, might have been to quick on this one. You are saying this 
happens for a non-parallel sync?

>>
>>> [vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: downloaded 
>>> 1.196 GiB (156.892 MiB/s)
>>> [vm/108]: 2025-09-19T07:37:19Z: sync done
>>> ...
>>>
>>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>>> ---
>>> changes since version 5:
>>> - uses BufferedLogger implementation, refactored accordingly
>>> - improve log line prefixes
>>> - add missing error contexts
>>>
>>>   src/server/pull.rs | 314 +++++++++++++++++++++++++++++++++------------
>>>   src/server/sync.rs |   8 +-
>>>   2 files changed, 237 insertions(+), 85 deletions(-)
>>>
>>> diff --git a/src/server/pull.rs b/src/server/pull.rs
>>> index 611441d2a..f7aae4d59 100644
>>> --- a/src/server/pull.rs
>>> +++ b/src/server/pull.rs
>>> @@ -5,11 +5,11 @@ use std::collections::{HashMap, HashSet};
>>>   use std::io::Seek;
>>>   use std::sync::atomic::{AtomicUsize, Ordering};
>>>   use std::sync::{Arc, Mutex};
>>> -use std::time::SystemTime;
>>> +use std::time::{Duration, SystemTime};
>>>   use anyhow::{bail, format_err, Context, Error};
>>>   use proxmox_human_byte::HumanByte;
>>> -use tracing::{info, warn};
>>> +use tracing::{info, Level};
>>>   use pbs_api_types::{
>>>       print_store_and_ns, ArchiveType, Authid, BackupArchiveName, 
>>> BackupDir, BackupGroup,
>>> @@ -27,6 +27,7 @@ use pbs_datastore::manifest::{BackupManifest, 
>>> FileInfo};
>>>   use pbs_datastore::read_chunk::AsyncReadChunk;
>>>   use pbs_datastore::{check_backup_owner, DataStore, 
>>> DatastoreBackend, StoreProgress};
>>>   use pbs_tools::bounded_join_set::BoundedJoinSet;
>>> +use pbs_tools::buffered_logger::{BufferedLogger, LogLineSender};
>>>   use pbs_tools::sha::sha256;
>>>   use super::sync::{
>>> @@ -153,6 +154,8 @@ async fn pull_index_chunks<I: IndexFile>(
>>>       index: I,
>>>       encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>>>       backend: &DatastoreBackend,
>>> +    archive_prefix: &str,
>>> +    log_sender: Arc<LogLineSender>,
>>>   ) -> Result<SyncStats, Error> {
>>>       use futures::stream::{self, StreamExt, TryStreamExt};
>>> @@ -247,11 +250,16 @@ async fn pull_index_chunks<I: IndexFile>(
>>>       let bytes = bytes.load(Ordering::SeqCst);
>>>       let chunk_count = chunk_count.load(Ordering::SeqCst);
>>> -    info!(
>>> -        "downloaded {} ({}/s)",
>>> -        HumanByte::from(bytes),
>>> -        HumanByte::new_binary(bytes as f64 / elapsed.as_secs_f64()),
>>> -    );
>>> +    log_sender
>>> +        .log(
>>> +            Level::INFO,
>>> +            format!(
>>> +                "{archive_prefix}: downloaded {} ({}/s)",
>>> +                HumanByte::from(bytes),
>>> +                HumanByte::new_binary(bytes as f64 / 
>>> elapsed.as_secs_f64()),
>>> +            ),
>>> +        )
>>> +        .await?;
>>>       Ok(SyncStats {
>>>           chunk_count,
>>> @@ -292,6 +300,7 @@ async fn pull_single_archive<'a>(
>>>       archive_info: &'a FileInfo,
>>>       encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>>>       backend: &DatastoreBackend,
>>> +    log_sender: Arc<LogLineSender>,
>>>   ) -> Result<SyncStats, Error> {
>>>       let archive_name = &archive_info.filename;
>>>       let mut path = snapshot.full_path();
>>> @@ -302,72 +311,104 @@ async fn pull_single_archive<'a>(
>>>       let mut sync_stats = SyncStats::default();
>>> -    info!("sync archive {archive_name}");
>>> +    let archive_prefix = format!("{}: {archive_name}", 
>>> snapshot.backup_time_string());
>>> +
>>> +    log_sender
>>> +        .log(Level::INFO, format!("{archive_prefix}: sync archive"))
>>> +        .await?;
>>> -    reader.load_file_into(archive_name, &tmp_path).await?;
>>> +    reader
>>> +        .load_file_into(archive_name, &tmp_path)
>>> +        .await
>>> +        .with_context(|| archive_prefix.clone())?;
>>> -    let mut tmpfile = 
>>> std::fs::OpenOptions::new().read(true).open(&tmp_path)?;
>>> +    let mut tmpfile = std::fs::OpenOptions::new()
>>> +        .read(true)
>>> +        .open(&tmp_path)
>>> +        .with_context(|| archive_prefix.clone())?;
>>>       match ArchiveType::from_path(archive_name)? {
>>>           ArchiveType::DynamicIndex => {
>>>               let index = DynamicIndexReader::new(tmpfile).map_err(| 
>>> err| {
>>> -                format_err!("unable to read dynamic index {:?} - 
>>> {}", tmp_path, err)
>>> +                format_err!("{archive_prefix}: unable to read 
>>> dynamic index {tmp_path:?} - {err}")
>>>               })?;
>>>               let (csum, size) = index.compute_csum();
>>> -            verify_archive(archive_info, &csum, size)?;
>>> +            verify_archive(archive_info, &csum, 
>>> size).with_context(|| archive_prefix.clone())?;
>>>               if reader.skip_chunk_sync(snapshot.datastore().name()) {
>>> -                info!("skipping chunk sync for same datastore");
>>> +                log_sender
>>> +                    .log(
>>> +                        Level::INFO,
>>> +                        format!("{archive_prefix}: skipping chunk 
>>> sync for same datastore"),
>>> +                    )
>>> +                    .await?;
>>>               } else {
>>>                   let stats = pull_index_chunks(
>>>                       reader
>>>                           .chunk_reader(archive_info.crypt_mode)
>>> -                        .context("failed to get chunk reader")?,
>>> +                        .context("failed to get chunk reader")
>>> +                        .with_context(|| archive_prefix.clone())?,
>>>                       snapshot.datastore().clone(),
>>>                       index,
>>>                       encountered_chunks,
>>>                       backend,
>>> +                    &archive_prefix,
>>> +                    Arc::clone(&log_sender),
>>>                   )
>>> -                .await?;
>>> +                .await
>>> +                .with_context(|| archive_prefix.clone())?;
>>>                   sync_stats.add(stats);
>>>               }
>>>           }
>>>           ArchiveType::FixedIndex => {
>>>               let index = FixedIndexReader::new(tmpfile).map_err(|err| {
>>> -                format_err!("unable to read fixed index '{:?}' - 
>>> {}", tmp_path, err)
>>> +                format_err!("{archive_name}: unable to read fixed 
>>> index '{tmp_path:?}' - {err}")
>>>               })?;
>>>               let (csum, size) = index.compute_csum();
>>> -            verify_archive(archive_info, &csum, size)?;
>>> +            verify_archive(archive_info, &csum, 
>>> size).with_context(|| archive_prefix.clone())?;
>>>               if reader.skip_chunk_sync(snapshot.datastore().name()) {
>>> -                info!("skipping chunk sync for same datastore");
>>> +                log_sender
>>> +                    .log(
>>> +                        Level::INFO,
>>> +                        format!("{archive_prefix}: skipping chunk 
>>> sync for same datastore"),
>>> +                    )
>>> +                    .await?;
>>>               } else {
>>>                   let stats = pull_index_chunks(
>>>                       reader
>>>                           .chunk_reader(archive_info.crypt_mode)
>>> -                        .context("failed to get chunk reader")?,
>>> +                        .context("failed to get chunk reader")
>>> +                        .with_context(|| archive_prefix.clone())?,
>>>                       snapshot.datastore().clone(),
>>>                       index,
>>>                       encountered_chunks,
>>>                       backend,
>>> +                    &archive_prefix,
>>> +                    Arc::clone(&log_sender),
>>>                   )
>>> -                .await?;
>>> +                .await
>>> +                .with_context(|| archive_prefix.clone())?;
>>>                   sync_stats.add(stats);
>>>               }
>>>           }
>>>           ArchiveType::Blob => {
>>> -            tmpfile.rewind()?;
>>> -            let (csum, size) = sha256(&mut tmpfile)?;
>>> -            verify_archive(archive_info, &csum, size)?;
>>> +            proxmox_lang::try_block!({
>>> +                tmpfile.rewind()?;
>>> +                let (csum, size) = sha256(&mut tmpfile)?;
>>> +                verify_archive(archive_info, &csum, size)
>>> +            })
>>> +            .with_context(|| archive_prefix.clone())?;
>>>           }
>>>       }
>>>       if let Err(err) = std::fs::rename(&tmp_path, &path) {
>>> -        bail!("Atomic rename file {:?} failed - {}", path, err);
>>> +        bail!("{archive_prefix}: Atomic rename file {path:?} failed 
>>> - {err}");
>>>       }
>>>       backend
>>>           .upload_index_to_backend(snapshot, archive_name)
>>> -        .await?;
>>> +        .await
>>> +        .with_context(|| archive_prefix.clone())?;
>>>       Ok(sync_stats)
>>>   }
>>> @@ -388,13 +429,24 @@ async fn pull_snapshot<'a>(
>>>       encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>>>       corrupt: bool,
>>>       is_new: bool,
>>> +    log_sender: Arc<LogLineSender>,
>>>   ) -> Result<SyncStats, Error> {
>>> +    let prefix = snapshot.backup_time_string().to_owned();
>>>       if is_new {
>>> -        info!("sync snapshot {}", snapshot.dir());
>>> +        log_sender
>>> +            .log(Level::INFO, format!("{prefix}: start sync"))
>>> +            .await?;
>>>       } else if corrupt {
>>> -        info!("re-sync snapshot {} due to corruption", snapshot.dir());
>>> +        log_sender
>>> +            .log(
>>> +                Level::INFO,
>>> +                format!("re-sync snapshot {prefix} due to corruption"),
>>> +            )
>>> +            .await?;
>>>       } else {
>>> -        info!("re-sync snapshot {}", snapshot.dir());
>>> +        log_sender
>>> +            .log(Level::INFO, format!("re-sync snapshot {prefix}"))
>>> +            .await?;
>>>       }
>>>       let mut sync_stats = SyncStats::default();
>>> @@ -409,7 +461,8 @@ async fn pull_snapshot<'a>(
>>>       let tmp_manifest_blob;
>>>       if let Some(data) = reader
>>>           .load_file_into(MANIFEST_BLOB_NAME.as_ref(), 
>>> &tmp_manifest_name)
>>> -        .await?
>>> +        .await
>>> +        .with_context(|| prefix.clone())?
>>>       {
>>>           tmp_manifest_blob = data;
>>>       } else {
>>> @@ -419,28 +472,34 @@ async fn pull_snapshot<'a>(
>>>       if manifest_name.exists() && !corrupt {
>>>           let manifest_blob = proxmox_lang::try_block!({
>>>               let mut manifest_file = 
>>> std::fs::File::open(&manifest_name).map_err(|err| {
>>> -                format_err!("unable to open local manifest 
>>> {manifest_name:?} - {err}")
>>> +                format_err!("{prefix}: unable to open local manifest 
>>> {manifest_name:?} - {err}")
>>>               })?;
>>> -            let manifest_blob = DataBlob::load_from_reader(&mut 
>>> manifest_file)?;
>>> +            let manifest_blob =
>>> +                DataBlob::load_from_reader(&mut 
>>> manifest_file).with_context(|| prefix.clone())?;
>>>               Ok(manifest_blob)
>>>           })
>>>           .map_err(|err: Error| {
>>> -            format_err!("unable to read local manifest 
>>> {manifest_name:?} - {err}")
>>> +            format_err!("{prefix}: unable to read local manifest 
>>> {manifest_name:?} - {err}")
>>>           })?;
>>>           if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
>>>               if !client_log_name.exists() {
>>> -                
>>> reader.try_download_client_log(&client_log_name).await?;
>>> +                reader
>>> +                    .try_download_client_log(&client_log_name)
>>> +                    .await
>>> +                    .with_context(|| prefix.clone())?;
>>>               };
>>> -            info!("no data changes");
>>> +            log_sender
>>> +                .log(Level::INFO, format!("{prefix}: no data changes"))
>>> +                .await?;
>>>               let _ = std::fs::remove_file(&tmp_manifest_name);
>>>               return Ok(sync_stats); // nothing changed
>>>           }
>>>       }
>>>       let manifest_data = tmp_manifest_blob.raw_data().to_vec();
>>> -    let manifest = BackupManifest::try_from(tmp_manifest_blob)?;
>>> +    let manifest = 
>>> BackupManifest::try_from(tmp_manifest_blob).with_context(|| 
>>> prefix.clone())?;
>>>       if ignore_not_verified_or_encrypted(
>>>           &manifest,
>>> @@ -464,35 +523,54 @@ async fn pull_snapshot<'a>(
>>>           path.push(&item.filename);
>>>           if !corrupt && path.exists() {
>>> -            let filename: BackupArchiveName = 
>>> item.filename.as_str().try_into()?;
>>> +            let filename: BackupArchiveName = item
>>> +                .filename
>>> +                .as_str()
>>> +                .try_into()
>>> +                .with_context(|| prefix.clone())?;
>>>               match filename.archive_type() {
>>>                   ArchiveType::DynamicIndex => {
>>> -                    let index = DynamicIndexReader::open(&path)?;
>>> +                    let index = 
>>> DynamicIndexReader::open(&path).with_context(|| prefix.clone())?;
>>>                       let (csum, size) = index.compute_csum();
>>>                       match manifest.verify_file(&filename, &csum, 
>>> size) {
>>>                           Ok(_) => continue,
>>>                           Err(err) => {
>>> -                            info!("detected changed file {path:?} - 
>>> {err}");
>>> +                            log_sender
>>> +                                .log(
>>> +                                    Level::INFO,
>>> +                                    format!("{prefix}: detected 
>>> changed file {path:?} - {err}"),
>>> +                                )
>>> +                                .await?;
>>>                           }
>>>                       }
>>>                   }
>>>                   ArchiveType::FixedIndex => {
>>> -                    let index = FixedIndexReader::open(&path)?;
>>> +                    let index = 
>>> FixedIndexReader::open(&path).with_context(|| prefix.clone())?;
>>>                       let (csum, size) = index.compute_csum();
>>>                       match manifest.verify_file(&filename, &csum, 
>>> size) {
>>>                           Ok(_) => continue,
>>>                           Err(err) => {
>>> -                            info!("detected changed file {path:?} - 
>>> {err}");
>>> +                            log_sender
>>> +                                .log(
>>> +                                    Level::INFO,
>>> +                                    format!("{prefix}: detected 
>>> changed file {path:?} - {err}"),
>>> +                                )
>>> +                                .await?;
>>>                           }
>>>                       }
>>>                   }
>>>                   ArchiveType::Blob => {
>>> -                    let mut tmpfile = std::fs::File::open(&path)?;
>>> -                    let (csum, size) = sha256(&mut tmpfile)?;
>>> +                    let mut tmpfile = 
>>> std::fs::File::open(&path).with_context(|| prefix.clone())?;
>>> +                    let (csum, size) = sha256(&mut 
>>> tmpfile).with_context(|| prefix.clone())?;
>>>                       match manifest.verify_file(&filename, &csum, 
>>> size) {
>>>                           Ok(_) => continue,
>>>                           Err(err) => {
>>> -                            info!("detected changed file {path:?} - 
>>> {err}");
>>> +                            log_sender
>>> +                                .log(
>>> +                                    Level::INFO,
>>> +                                    format!("{prefix}: detected 
>>> changed file {path:?} - {err}"),
>>> +                                )
>>> +                                .await?;
>>>                           }
>>>                       }
>>>                   }
>>> @@ -505,13 +583,14 @@ async fn pull_snapshot<'a>(
>>>               item,
>>>               encountered_chunks.clone(),
>>>               backend,
>>> +            Arc::clone(&log_sender),
>>>           )
>>>           .await?;
>>>           sync_stats.add(stats);
>>>       }
>>>       if let Err(err) = std::fs::rename(&tmp_manifest_name, 
>>> &manifest_name) {
>>> -        bail!("Atomic rename file {:?} failed - {}", manifest_name, 
>>> err);
>>> +        bail!("{prefix}: Atomic rename file {manifest_name:?} failed 
>>> - {err}");
>>>       }
>>>       if let DatastoreBackend::S3(s3_client) = backend {
>>>           let object_key = pbs_datastore::s3::object_key_from_path(
>>> @@ -524,33 +603,40 @@ async fn pull_snapshot<'a>(
>>>           let _is_duplicate = s3_client
>>>               .upload_replace_with_retry(object_key, data)
>>>               .await
>>> -            .context("failed to upload manifest to s3 backend")?;
>>> +            .context("failed to upload manifest to s3 backend")
>>> +            .with_context(|| prefix.clone())?;
>>>       }
>>>       if !client_log_name.exists() {
>>> -        reader.try_download_client_log(&client_log_name).await?;
>>> +        reader
>>> +            .try_download_client_log(&client_log_name)
>>> +            .await
>>> +            .with_context(|| prefix.clone())?;
>>>           if client_log_name.exists() {
>>>               if let DatastoreBackend::S3(s3_client) = backend {
>>>                   let object_key = 
>>> pbs_datastore::s3::object_key_from_path(
>>>                       &snapshot.relative_path(),
>>>                       CLIENT_LOG_BLOB_NAME.as_ref(),
>>>                   )
>>> -                .context("invalid archive object key")?;
>>> +                .context("invalid archive object key")
>>> +                .with_context(|| prefix.clone())?;
>>>                   let data = tokio::fs::read(&client_log_name)
>>>                       .await
>>> -                    .context("failed to read log file contents")?;
>>> +                    .context("failed to read log file contents")
>>> +                    .with_context(|| prefix.clone())?;
>>>                   let contents = hyper::body::Bytes::from(data);
>>>                   let _is_duplicate = s3_client
>>>                       .upload_replace_with_retry(object_key, contents)
>>>                       .await
>>> -                    .context("failed to upload client log to s3 
>>> backend")?;
>>> +                    .context("failed to upload client log to s3 
>>> backend")
>>> +                    .with_context(|| prefix.clone())?;
>>>               }
>>>           }
>>>       };
>>>       snapshot
>>>           .cleanup_unreferenced_files(&manifest)
>>> -        .map_err(|err| format_err!("failed to cleanup unreferenced 
>>> files - {err}"))?;
>>> +        .map_err(|err| format_err!("{prefix}: failed to cleanup 
>>> unreferenced files - {err}"))?;
>>>       Ok(sync_stats)
>>>   }
>>> @@ -565,10 +651,14 @@ async fn pull_snapshot_from<'a>(
>>>       snapshot: &'a pbs_datastore::BackupDir,
>>>       encountered_chunks: Arc<Mutex<EncounteredChunks>>,
>>>       corrupt: bool,
>>> +    log_sender: Arc<LogLineSender>,
>>>   ) -> Result<SyncStats, Error> {
>>> +    let prefix = format!("{}", snapshot.backup_time_string());
>>> +
>>>       let (_path, is_new, _snap_lock) = snapshot
>>>           .datastore()
>>> -        .create_locked_backup_dir(snapshot.backup_ns(), 
>>> snapshot.as_ref())?;
>>> +        .create_locked_backup_dir(snapshot.backup_ns(), 
>>> snapshot.as_ref())
>>> +        .context(prefix.clone())?;
>>>       let result = pull_snapshot(
>>>           params,
>>> @@ -577,6 +667,7 @@ async fn pull_snapshot_from<'a>(
>>>           encountered_chunks,
>>>           corrupt,
>>>           is_new,
>>> +        Arc::clone(&log_sender),
>>>       )
>>>       .await;
>>> @@ -589,11 +680,20 @@ async fn pull_snapshot_from<'a>(
>>>                       snapshot.as_ref(),
>>>                       true,
>>>                   ) {
>>> -                    info!("cleanup error - {cleanup_err}");
>>> +                    log_sender
>>> +                        .log(
>>> +                            Level::INFO,
>>> +                            format!("{prefix}: cleanup error - 
>>> {cleanup_err}"),
>>> +                        )
>>> +                        .await?;
>>>                   }
>>>                   return Err(err);
>>>               }
>>> -            Ok(_) => info!("sync snapshot {} done", snapshot.dir()),
>>> +            Ok(_) => {
>>> +                log_sender
>>> +                    .log(Level::INFO, format!("{prefix}: sync done"))
>>> +                    .await?
>>> +            }
>>>           }
>>>       }
>>> @@ -622,7 +722,9 @@ async fn pull_group(
>>>       source_namespace: &BackupNamespace,
>>>       group: &BackupGroup,
>>>       shared_group_progress: Arc<SharedGroupProgress>,
>>> +    log_sender: Arc<LogLineSender>,
>>>   ) -> Result<SyncStats, Error> {
>>> +    let prefix = format!("{group}");
>>>       let mut already_synced_skip_info = 
>>> SkipInfo::new(SkipReason::AlreadySynced);
>>>       let mut transfer_last_skip_info = 
>>> SkipInfo::new(SkipReason::TransferLast);
>>> @@ -714,11 +816,15 @@ async fn pull_group(
>>>           .collect();
>>>       if already_synced_skip_info.count > 0 {
>>> -        info!("{already_synced_skip_info}");
>>> +        log_sender
>>> +            .log(Level::INFO, format!("{prefix}: 
>>> {already_synced_skip_info}"))
>>> +            .await?;
>>>           already_synced_skip_info.reset();
>>>       }
>>>       if transfer_last_skip_info.count > 0 {
>>> -        info!("{transfer_last_skip_info}");
>>> +        log_sender
>>> +            .log(Level::INFO, format!("{prefix}: 
>>> {transfer_last_skip_info}"))
>>> +            .await?;
>>>           transfer_last_skip_info.reset();
>>>       }
>>> @@ -730,8 +836,8 @@ async fn pull_group(
>>>           .store
>>>           .backup_group(target_ns.clone(), group.clone());
>>>       if let Some(info) = 
>>> backup_group.last_backup(true).unwrap_or(None) {
>>> -        let mut reusable_chunks = encountered_chunks.lock().unwrap();
>>>           if let Err(err) = proxmox_lang::try_block!({
>>> +            let mut reusable_chunks = 
>>> encountered_chunks.lock().unwrap();
>>>               let _snapshot_guard = info
>>>                   .backup_dir
>>>                   .lock_shared()
>>> @@ -780,7 +886,12 @@ async fn pull_group(
>>>               }
>>>               Ok::<(), Error>(())
>>>           }) {
>>> -            warn!("Failed to collect reusable chunk from last 
>>> backup: {err:#?}");
>>> +            log_sender
>>> +                .log(
>>> +                    Level::WARN,
>>> +                    format!("Failed to collect reusable chunk from 
>>> last backup: {err:#?}"),
>>> +                )
>>> +                .await?;
>>>           }
>>>       }
>>> @@ -805,13 +916,16 @@ async fn pull_group(
>>>               &to_snapshot,
>>>               encountered_chunks.clone(),
>>>               corrupt,
>>> +            Arc::clone(&log_sender),
>>>           )
>>>           .await;
>>>           // Update done groups progress by other parallel running pulls
>>>           local_progress.done_groups = 
>>> shared_group_progress.load_done();
>>>           local_progress.done_snapshots = pos as u64 + 1;
>>> -        info!("percentage done: group {group}: {local_progress}");
>>> +        log_sender
>>> +            .log(Level::INFO, format!("percentage done: 
>>> {local_progress}"))
>>> +            .await?;
>>>           let stats = result?; // stop on error
>>>           sync_stats.add(stats);
>>> @@ -829,13 +943,23 @@ async fn pull_group(
>>>                   continue;
>>>               }
>>>               if snapshot.is_protected() {
>>> -                info!(
>>> -                    "don't delete vanished snapshot {} (protected)",
>>> -                    snapshot.dir()
>>> -                );
>>> +                log_sender
>>> +                    .log(
>>> +                        Level::INFO,
>>> +                        format!(
>>> +                            "{prefix}: don't delete vanished 
>>> snapshot {} (protected)",
>>> +                            snapshot.dir(),
>>> +                        ),
>>> +                    )
>>> +                    .await?;
>>>                   continue;
>>>               }
>>> -            info!("delete vanished snapshot {}", snapshot.dir());
>>> +            log_sender
>>> +                .log(
>>> +                    Level::INFO,
>>> +                    format!("delete vanished snapshot {}", 
>>> snapshot.dir()),
>>> +                )
>>> +                .await?;
>>>               params
>>>                   .target
>>>                   .store
>>> @@ -1035,10 +1159,7 @@ pub(crate) async fn pull_store(mut params: 
>>> PullParameters) -> Result<SyncStats,
>>>               }
>>>               Err(err) => {
>>>                   errors = true;
>>> -                info!(
>>> -                    "Encountered errors while syncing namespace {} - 
>>> {err}",
>>> -                    &namespace,
>>> -                );
>>> +                info!("Encountered errors while syncing namespace 
>>> {namespace} - {err}");
>>>               }
>>>           };
>>>       }
>>> @@ -1064,6 +1185,7 @@ async fn lock_and_pull_group(
>>>       namespace: &BackupNamespace,
>>>       target_namespace: &BackupNamespace,
>>>       shared_group_progress: Arc<SharedGroupProgress>,
>>> +    log_sender: Arc<LogLineSender>,
>>>   ) -> Result<SyncStats, Error> {
>>>       let (owner, _lock_guard) =
>>>           match params
>>> @@ -1073,25 +1195,47 @@ async fn lock_and_pull_group(
>>>           {
>>>               Ok(res) => res,
>>>               Err(err) => {
>>> -                info!("sync group {group} failed - group lock 
>>> failed: {err}");
>>> -                info!("create_locked_backup_group failed");
>>> +                log_sender
>>> +                    .log(
>>> +                        Level::INFO,
>>> +                        format!("sync group {group} failed - group 
>>> lock failed: {err}"),
>>> +                    )
>>> +                    .await?;
>>> +                log_sender
>>> +                    .log(Level::INFO, "create_locked_backup_group 
>>> failed".to_string())
>>> +                    .await?;
>>>                   return Err(err);
>>>               }
>>>           };
>>>       if params.owner != owner {
>>>           // only the owner is allowed to create additional snapshots
>>> -        info!(
>>> -            "sync group {group} failed - owner check failed ({} != 
>>> {owner})",
>>> -            params.owner
>>> -        );
>>> +        log_sender
>>> +            .log(
>>> +                Level::INFO,
>>> +                format!(
>>> +                    "sync group {group} failed - owner check failed 
>>> ({} != {owner})",
>>> +                    params.owner,
>>> +                ),
>>> +            )
>>> +            .await?;
>>>           return Err(format_err!("owner check failed"));
>>>       }
>>> -    match pull_group(params, namespace, group, 
>>> shared_group_progress).await {
>>> +    match pull_group(
>>> +        params,
>>> +        namespace,
>>> +        group,
>>> +        shared_group_progress,
>>> +        Arc::clone(&log_sender),
>>> +    )
>>> +    .await
>>> +    {
>>>           Ok(stats) => Ok(stats),
>>>           Err(err) => {
>>> -            info!("sync group {group} failed - {err:#}");
>>> +            log_sender
>>> +                .log(Level::INFO, format!("sync group {group} failed 
>>> - {err:#}"))
>>> +                .await?;
>>>               Err(err)
>>>           }
>>>       }
>>> @@ -1124,7 +1268,7 @@ async fn pull_ns(
>>>       list.sort_unstable();
>>>       info!(
>>> -        "found {} groups to sync (out of {unfiltered_count} total)",
>>> +        "Found {} groups to sync (out of {unfiltered_count} total)",
>>>           list.len()
>>>       );
>>> @@ -1143,6 +1287,10 @@ async fn pull_ns(
>>>       let shared_group_progress = 
>>> Arc::new(SharedGroupProgress::with_total_groups(list.len()));
>>>       let mut group_workers = 
>>> BoundedJoinSet::new(params.worker_threads.unwrap_or(1));
>>> +    let (buffered_logger, sender_builder) = BufferedLogger::new(5, 
>>> Duration::from_secs(1));
>>> +    // runs until sender_builder and all senders build from it are 
>>> being dropped
>>> +    buffered_logger.run_log_collection();
>>> +
>>>       let mut process_results = |results| {
>>>           for result in results {
>>>               match result {
>>> @@ -1160,16 +1308,20 @@ async fn pull_ns(
>>>           let target_ns = target_ns.clone();
>>>           let params = Arc::clone(&params);
>>>           let group_progress_cloned = 
>>> Arc::clone(&shared_group_progress);
>>> +        let log_sender = 
>>> Arc::new(sender_builder.sender_with_label(group.to_string()));
>>>           let results = group_workers
>>>               .spawn_task(async move {
>>> -                lock_and_pull_group(
>>> +                let result = lock_and_pull_group(
>>>                       Arc::clone(&params),
>>>                       &group,
>>>                       &namespace,
>>>                       &target_ns,
>>>                       group_progress_cloned,
>>> +                    Arc::clone(&log_sender),
>>>                   )
>>> -                .await
>>> +                .await;
>>> +                let _ = log_sender.flush().await;
>>> +                result
>>>               })
>>>               .await
>>>               .map_err(|err| format_err!("failed to join on worker 
>>> task: {err:#}"))?;
>>> @@ -1197,7 +1349,7 @@ async fn pull_ns(
>>>                   if !local_group.apply_filters(&params.group_filter) {
>>>                       continue;
>>>                   }
>>> -                info!("delete vanished group '{local_group}'");
>>> +                info!("Delete vanished group '{local_group}'");
>>>                   let delete_stats_result = params
>>>                       .target
>>>                       .store
>>> @@ -1206,7 +1358,7 @@ async fn pull_ns(
>>>                   match delete_stats_result {
>>>                       Ok(stats) => {
>>>                           if !stats.all_removed() {
>>> -                            info!("kept some protected snapshots of 
>>> group '{local_group}'");
>>> +                            info!("Kept some protected snapshots of 
>>> group '{local_group}'");
>>>                               
>>> sync_stats.add(SyncStats::from(RemovedVanishedStats {
>>>                                   snapshots: stats.removed_snapshots(),
>>>                                   groups: 0,
>>> @@ -1229,7 +1381,7 @@ async fn pull_ns(
>>>               Ok(())
>>>           });
>>>           if let Err(err) = result {
>>> -            info!("error during cleanup: {err}");
>>> +            info!("Error during cleanup: {err}");
>>>               errors = true;
>>>           };
>>>       }
>>> diff --git a/src/server/sync.rs b/src/server/sync.rs
>>> index e88418442..17ed4839f 100644
>>> --- a/src/server/sync.rs
>>> +++ b/src/server/sync.rs
>>> @@ -13,7 +13,6 @@ use futures::{future::FutureExt, select};
>>>   use hyper::http::StatusCode;
>>>   use pbs_config::BackupLockGuard;
>>>   use serde_json::json;
>>> -use tokio::task::JoinSet;
>>>   use tracing::{info, warn};
>>>   use proxmox_human_byte::HumanByte;
>>> @@ -136,13 +135,13 @@ impl SyncSourceReader for RemoteSourceReader {
>>>                   Some(HttpError { code, message }) => match *code {
>>>                       StatusCode::NOT_FOUND => {
>>>                           info!(
>>> -                            "skipping snapshot {} - vanished since 
>>> start of sync",
>>> +                            "Snapshot {}: skipped because vanished 
>>> since start of sync",
>>>                               &self.dir
>>>                           );
>>>                           return Ok(None);
>>>                       }
>>>                       _ => {
>>> -                        bail!("HTTP error {code} - {message}");
>>> +                        bail!("Snapshot {}: HTTP error {code} - 
>>> {message}", &self.dir);
>>>                       }
>>>                   },
>>>                   None => {
>>> @@ -176,7 +175,8 @@ impl SyncSourceReader for RemoteSourceReader {
>>>                   bail!("Atomic rename file {to_path:?} failed - 
>>> {err}");
>>>               }
>>>               info!(
>>> -                "got backup log file {client_log_name}",
>>> +                "Snapshot {snapshot}: got backup log file 
>>> {client_log_name}",
>>> +                snapshot = &self.dir,
>>>                   client_log_name = client_log_name.deref()
>>>               );
>>>           }
>>> -- 
>>> 2.47.3
>>>
>>>
>>>
>>>
>>>
>>>
> 
> 
> 
> 
> 





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context
  2026-04-21  7:42       ` Christian Ebner
@ 2026-04-21  8:00         ` Fabian Grünbichler
  2026-04-21  8:04           ` Christian Ebner
  0 siblings, 1 reply; 29+ messages in thread
From: Fabian Grünbichler @ 2026-04-21  8:00 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

On April 21, 2026 9:42 am, Christian Ebner wrote:
> On 4/21/26 9:21 AM, Christian Ebner wrote:
>> On 4/20/26 1:54 PM, Fabian Grünbichler wrote:
>>> On April 17, 2026 11:26 am, Christian Ebner wrote:
>>>> Pulling groups and therefore also snapshots in parallel leads to
>>>> unordered log outputs, making it mostly impossible to relate a log
>>>> message to a backup snapshot/group.
>>>>
>>>> Therefore, prefix pull job log messages by the corresponding group or
>>>> snapshot and set the error context accordingly.
>>>>
>>>> Also, reword some messages, inline variables in format strings and
>>>> start log lines with capital letters to get consistent output.
>>>>
>>>> By using the buffered logger implementation and buffer up to 5 lines
>>>> with a timeout of 1 second, subsequent log lines arriving in fast
>>>> succession are kept together, reducing the mixing of lines.
>>>>
>>>> Example output for a sequential pull job:
>>>> ```
>>>> ...
>>>> [ct/100]: 2025-11-17T10:11:42Z: start sync
>>>> [ct/100]: 2025-11-17T10:11:42Z: pct.conf.blob: sync archive
>>>> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: sync archive
>>>> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: downloaded 16.785 
>>>> MiB (280.791 MiB/s)
>>>> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: sync archive
>>>> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: downloaded 65.703 
>>>> KiB (29.1 MiB/s)
>>>> [ct/100]: 2025-11-17T10:11:42Z: sync done
>>>> [ct/100]: percentage done: 9.09% (1/11 groups)
>>>> [ct/101]: 2026-03-31T12:20:16Z: start sync
>>>> [ct/101]: 2026-03-31T12:20:16Z: pct.conf.blob: sync archive
>>>> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: sync archive
>>>> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: downloaded 199.806 
>>>> MiB (311.91 MiB/s)
>>>> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: sync archive
>>>> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: downloaded 
>>>> 180.379 KiB (22.748 MiB/s)
>>>> [ct/101]: 2026-03-31T12:20:16Z: sync done
>>>
>>> this is probably the wrong patch for this comment, but since you have
>>> the sample output here ;)
>>>
>>> should we pad the labels? the buffered logger knows all currently active
>>> labels and could adapt it? otherwise especially for host backups the
>>> logs are again not really scannable by humans, because there will be
>>> weird jumps in alignment.. or (.. continued below)
>> 
>> That will not work though? While the logger knows about the labels when 
>> receiving them, it does not know at all about future ones that might 
>> still come in. So it can happen that the padding does not fit anymore, 
>> and one gets even uglier formatting issues.
>> 
>> I would prefer to keep them as is for the time being. For vm/ct it is 
>> not that bad, one could maybe forsee the ID to be a 4 digit number and 
>> define a minimum label lenght to be padded if not reached.

ack, let's leave this as it is for now, can always be bikeshedded if
needed later on!

>> 
>> Host backups or backups with explicit label
>> 

cut off?

>> 
>>>> ...
>>>> ```
>>>>
>>>> Example output for a parallel pull job:
>>>> ```
>>>> ...
>>>> [ct/107]: 2025-07-16T09:14:01Z: start sync
>>>> [ct/107]: 2025-07-16T09:14:01Z: pct.conf.blob: sync archive
>>>> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: sync archive
>>>> [vm/108]: 2025-09-19T07:37:19Z: start sync
>>>> [vm/108]: 2025-09-19T07:37:19Z: qemu-server.conf.blob: sync archive
>>>> [vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: sync archive
>>>> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: downloaded 609.233 
>>>> MiB (112.628 MiB/s)
>>>> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: sync archive
>>>> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: downloaded 1.172 MiB 
>>>> (17.838 MiB/s)
>>>> [ct/107]: 2025-07-16T09:14:01Z: sync done
>>>> [ct/107]: percentage done: 72.73% (8/11 groups)
>>>
>>> the way the prefix and snapshot are formatted could also be interpreted
>>> at first glance as a timestamp of the log line.. why not just prepend
>>> the prefix on the logger side, and leave it up to the caller to do the
>>> formatting? then we could us "{type}/{id}/" as prefix here? or add
>>> 'snapshot' in those lines to make it clear? granted, this is more of an
>>> issue when viewing the log via `proxmox-backup-manager`, as in the UI we
>>> have the log timestamps up front..
>> 
>> I simply followed along the line of your suggestion here. In prior 
>> versions I had exactly that but you rejected it as to verbose? :)
>> 
>> https://lore.proxmox.com/pbs-devel/1774263381.bngcrer2th.astroid@yuna.none/
>> 

my suggestions (for just the ct/107 lines above) would have been:

ct/107: 2025-07-16T09:14:01Z: start sync
ct/107: 2025-07-16T09:14:01Z pct.conf.blob: sync archive
ct/107: 2025-07-16T09:14:01Z root.ppxar.didx: sync archive
ct/107: 2025-07-16T09:14:01Z root.ppxar.didx: downloaded 609.233  MiB (112.628 MiB/s)
ct/107: 2025-07-16T09:14:01Z root.mpxar.didx: sync archive
ct/107: 2025-07-16T09:14:01Z root.mpxar.didx: downloaded 1.172 MiB  (17.838 MiB/s)
ct/107: 2025-07-16T09:14:01Z: sync done
ct/107: percentage done: 72.73% (8/11 groups)

though it might make sense to group that even more visually

ct/107: 2025-07-16T09:14:01Z: start sync
ct/107: 2025-07-16T09:14:01Z/pct.conf.blob: sync archive
ct/107: 2025-07-16T09:14:01Z/root.ppxar.didx: sync archive
ct/107: 2025-07-16T09:14:01Z/root.ppxar.didx: downloaded 609.233  MiB (112.628 MiB/s)
ct/107: 2025-07-16T09:14:01Z/root.mpxar.didx: sync archive
ct/107: 2025-07-16T09:14:01Z/root.mpxar.didx: downloaded 1.172 MiB  (17.838 MiB/s)
ct/107: 2025-07-16T09:14:01Z: sync done
ct/107: percentage done: 72.73% (8/11 groups)

or

[ct/107 2025-07-16T09:14:01Z] start sync
[ct/107 2025-07-16T09:14:01Z pct.conf.blob] sync archive
[ct/107 2025-07-16T09:14:01Z root.ppxar.didx] sync archive
[ct/107 2025-07-16T09:14:01Z root.ppxar.didx] downloaded 609.233  MiB (112.628 MiB/s)
[ct/107 2025-07-16T09:14:01Z root.mpxar.didx] sync archive
[ct/107 2025-07-16T09:14:01Z root.mpxar.didx] downloaded 1.172 MiB  (17.838 MiB/s)
[ct/107 2025-07-16T09:14:01Z] sync done
[ct/107] percentage done: 72.73% (8/11 groups)

but that is probably a matter of taste, so let's postpone this aspect
for now.

>>>
>>> and maybe(?) log the progress line using a different prefix? because
>>> right now the information that the group [ct/107] is finished is not
>>> really clear from the output, IMHO.
>> 
>> That I can add, yes.

thanks!

>> 
>>>
>>> the progress logging is also still broken (this is for a sync that takes
>>> a while, this is not log messages being buffered and re-ordered!):
>>>
>>> $ proxmox-backup-manager task log 
>>> 'UPID:yuna:00070656:001E420F:00000002:69E610EB:syncjob:local\x3atest\x3atank\x3a\x3as\x2dbc01cba6\x2d805a:root@pam:' | grep -e namespace -e 'percentage done'
>>> Syncing datastore 'test', root namespace into datastore 'tank', root 
>>> namespace
>>> Finished syncing root namespace, current progress: 0 groups, 0 snapshots
>>> Syncing datastore 'test', namespace 'test' into datastore 'tank', 
>>> namespace 'test'
>>> [host/exclusion-test]: percentage done: 5.26% (1/19 groups)
>>> [host/acltest]: percentage done: 5.26% (1/19 groups)
>>> [host/logtest]: percentage done: 5.26% (1/19 groups)
>>> [host/onemeg]: percentage done: 5.26% (1/19 groups)
>>> [host/fourmeg]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in 
>>> group #1)
>>> [host/symlink]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in 
>>> group #1)
>>> [host/symlink]: percentage done: 5.26% (1/19 groups)
>>> [host/fourmeg]: percentage done: 5.26% (1/19 groups)
>>> [host/format-v2-test]: percentage done: 1.75% (0/19 groups, 1/3 
>>> snapshots in group #1)
>>> [host/format-v2-test]: percentage done: 3.51% (0/19 groups, 2/3 
>>> snapshots in group #1)
>>> [host/format-v2-test]: percentage done: 5.26% (1/19 groups)
>>> [host/incrementaltest2]: percentage done: 2.63% (0/19 groups, 1/2 
>>> snapshots in group #1)
>>> [host/incrementaltest2]: percentage done: 5.26% (1/19 groups)
>> 
>> There will always be cases where it does not work I guess, not sure what 
>> I can do to make you happy here.
> 
> Oh, sorry, might have been to quick on this one. You are saying this 
> happens for a non-parallel sync?

no, this was a parallel sync, but the progress should be correct for
parallel syncs as well, shouldn't it?

here's the same log but less truncated:

[..]
2026-04-20T13:41:31+02:00: ----
2026-04-20T13:41:31+02:00: Syncing datastore 'test', namespace 'test' into datastore 'tank', namespace 'test'
2026-04-20T13:41:31+02:00: Found 19 groups to sync (out of 19 total)
2026-04-20T13:41:31+02:00: [ct/106]: sync group ct/106 failed - owner check failed (test@pbs != root@pam)
2026-04-20T13:41:31+02:00: [ct/999]: sync group ct/999 failed - owner check failed (test@pbs != root@pam)
2026-04-20T13:41:31+02:00: skipping snapshot host/test-another-mail/2024-03-14T08:44:25Z - in-progress backup
2026-04-20T13:41:31+02:00: [host/exclusion-test]: host/exclusion-test: skipped: 1 snapshot(s) (2024-04-23T06:53:44Z) - older than the newest snapshot present on sync target
2026-04-20T13:41:31+02:00: [host/exclusion-test]: re-sync snapshot 2024-04-23T06:54:30Z
2026-04-20T13:41:31+02:00: [host/exclusion-test]: 2024-04-23T06:54:30Z: no data changes
2026-04-20T13:41:31+02:00: [host/exclusion-test]: percentage done: 5.26% (1/19 groups)
2026-04-20T13:41:31+02:00: [host/acltest]: re-sync snapshot 2024-10-07T11:17:58Z
2026-04-20T13:41:31+02:00: [host/acltest]: 2024-10-07T11:17:58Z: no data changes
2026-04-20T13:41:31+02:00: [host/acltest]: percentage done: 5.26% (1/19 groups)
2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: start sync
2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: test.pxar.didx: sync archive
2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: test.pxar.didx: downloaded 13.169 KiB (168.332 KiB/s)
2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: catalog.pcat1.didx: sync archive
2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: catalog.pcat1.didx: downloaded 2.157 KiB (10.2 KiB/s)
2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: rsa-encrypted.key.blob: sync archive
2026-04-20T13:41:33+02:00: [host/logtest]: 2024-01-30T10:52:34Z: sync done
2026-04-20T13:41:33+02:00: [host/logtest]: percentage done: 5.26% (1/19 groups)
2026-04-20T13:41:33+02:00: [host/onemeg]: 2023-06-28T11:13:51Z: start sync
2026-04-20T13:41:33+02:00: [host/onemeg]: 2023-06-28T11:13:51Z: test.img.fidx: sync archive
2026-04-20T13:41:33+02:00: [host/onemeg]: 2023-06-28T11:13:51Z: test.img.fidx: downloaded 2.012 KiB (5.704 KiB/s)
2026-04-20T13:41:33+02:00: [host/onemeg]: 2023-06-28T11:13:51Z: sync done
2026-04-20T13:41:33+02:00: [host/onemeg]: percentage done: 5.26% (1/19 groups)
2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T11:14:35Z: start sync
2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T11:14:35Z: test.img.fidx: sync archive
2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T11:14:35Z: test.img.fidx: downloaded 2.012 KiB (5.885 KiB/s)
2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T11:14:35Z: sync done
2026-04-20T13:41:33+02:00: [host/fourmeg]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T12:37:16Z: start sync
2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: start sync
2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: symlink.pxar.didx: sync archive
2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: symlink.pxar.didx: downloaded 145 B (802.587 B/s)
2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: catalog.pcat1.didx: sync archive
2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: catalog.pcat1.didx: downloaded 57 B (279.566 B/s)
2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: sync done
2026-04-20T13:41:37+02:00: [host/incrementaltest2]: 2023-09-25T13:43:25Z: start sync
2026-04-20T13:41:37+02:00: [host/incrementaltest2]: 2023-09-25T13:43:25Z: test.pxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/format-v3-test]: 2024-06-03T10:34:11Z: start sync
2026-04-20T13:41:37+02:00: [host/format-v3-test]: 2024-06-03T10:34:11Z: linux.ppxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/fourmeg]: 2023-06-28T12:37:16Z: test.img.fidx: sync archive
2026-04-20T13:41:37+02:00: [host/linuxtesttest]: 2024-07-17T13:38:31Z: start sync
2026-04-20T13:41:37+02:00: [host/linuxtesttest]: 2024-07-17T13:38:31Z: linux.pxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/inctest]: 2023-11-13T13:22:57Z: start sync
2026-04-20T13:41:37+02:00: [host/inctest]: 2023-11-13T13:22:57Z: test.pxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/inctest2]: 2023-11-13T13:53:27Z: start sync
2026-04-20T13:41:37+02:00: [host/inctest2]: 2023-11-13T13:53:27Z: test.pxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/format-v2-test]: 2024-06-07T12:08:37Z: start sync
2026-04-20T13:41:37+02:00: [host/format-v2-test]: 2024-06-07T12:08:37Z: linux.ppxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/foobar]: re-sync snapshot 2024-06-10T07:48:51Z due to corruption
2026-04-20T13:41:37+02:00: [host/foobar]: 2024-06-10T07:48:51Z: foobar.ppxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/symlink]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: start sync
2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: foo.pxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: foo.pxar.didx: downloaded 0 B (0 B/s)
2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: catalog.pcat1.didx: sync archive
2026-04-20T13:41:37+02:00: [host/bookworm]: re-sync snapshot 2023-01-18T11:10:34Z due to corruption
2026-04-20T13:41:37+02:00: [host/bookworm]: 2023-01-18T11:10:34Z: repo.pxar.didx: sync archive
2026-04-20T13:41:37+02:00: [host/test]: 2024-07-10T08:18:51Z: start sync
2026-04-20T13:41:37+02:00: [host/test]: 2024-07-10T08:18:51Z: test.img.fidx: sync archive
2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: catalog.pcat1.didx: downloaded 53 B (216.705 B/s)
2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: sync done
2026-04-20T13:41:37+02:00: [host/symlink]: percentage done: 5.26% (1/19 groups)
2026-04-20T13:42:55+02:00: [host/fourmeg]: 2023-06-28T12:37:16Z: test.img.fidx: downloaded 4 MiB (51.191 KiB/s)
2026-04-20T13:42:55+02:00: [host/fourmeg]: 2023-06-28T12:37:16Z: sync done
2026-04-20T13:42:55+02:00: [host/fourmeg]: percentage done: 5.26% (1/19 groups)

those percentage lines are just complete bogus?? and there are at least
two "breaks" of over a second where all buffered log lines must have
been flushed, so it's also not that the lines with lower perecentage
were re-ordered somehow..

it goes on and on and the percentage always stays at at most 5.26% since
the finished group count never goes higher than 1.. though I have to
admit I stopped it before it finished, so maybe it would have later gone
higher? I can re-run it to completion and check what it says before it's
done?




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context
  2026-04-21  8:00         ` Fabian Grünbichler
@ 2026-04-21  8:04           ` Christian Ebner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-21  8:04 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/21/26 9:59 AM, Fabian Grünbichler wrote:
> On April 21, 2026 9:42 am, Christian Ebner wrote:
>> On 4/21/26 9:21 AM, Christian Ebner wrote:
>>> On 4/20/26 1:54 PM, Fabian Grünbichler wrote:
>>>> On April 17, 2026 11:26 am, Christian Ebner wrote:
>>>>> Pulling groups and therefore also snapshots in parallel leads to
>>>>> unordered log outputs, making it mostly impossible to relate a log
>>>>> message to a backup snapshot/group.
>>>>>
>>>>> Therefore, prefix pull job log messages by the corresponding group or
>>>>> snapshot and set the error context accordingly.
>>>>>
>>>>> Also, reword some messages, inline variables in format strings and
>>>>> start log lines with capital letters to get consistent output.
>>>>>
>>>>> By using the buffered logger implementation and buffer up to 5 lines
>>>>> with a timeout of 1 second, subsequent log lines arriving in fast
>>>>> succession are kept together, reducing the mixing of lines.
>>>>>
>>>>> Example output for a sequential pull job:
>>>>> ```
>>>>> ...
>>>>> [ct/100]: 2025-11-17T10:11:42Z: start sync
>>>>> [ct/100]: 2025-11-17T10:11:42Z: pct.conf.blob: sync archive
>>>>> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: sync archive
>>>>> [ct/100]: 2025-11-17T10:11:42Z: root.ppxar.didx: downloaded 16.785
>>>>> MiB (280.791 MiB/s)
>>>>> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: sync archive
>>>>> [ct/100]: 2025-11-17T10:11:42Z: root.mpxar.didx: downloaded 65.703
>>>>> KiB (29.1 MiB/s)
>>>>> [ct/100]: 2025-11-17T10:11:42Z: sync done
>>>>> [ct/100]: percentage done: 9.09% (1/11 groups)
>>>>> [ct/101]: 2026-03-31T12:20:16Z: start sync
>>>>> [ct/101]: 2026-03-31T12:20:16Z: pct.conf.blob: sync archive
>>>>> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: sync archive
>>>>> [ct/101]: 2026-03-31T12:20:16Z: root.pxar.didx: downloaded 199.806
>>>>> MiB (311.91 MiB/s)
>>>>> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: sync archive
>>>>> [ct/101]: 2026-03-31T12:20:16Z: catalog.pcat1.didx: downloaded
>>>>> 180.379 KiB (22.748 MiB/s)
>>>>> [ct/101]: 2026-03-31T12:20:16Z: sync done
>>>>
>>>> this is probably the wrong patch for this comment, but since you have
>>>> the sample output here ;)
>>>>
>>>> should we pad the labels? the buffered logger knows all currently active
>>>> labels and could adapt it? otherwise especially for host backups the
>>>> logs are again not really scannable by humans, because there will be
>>>> weird jumps in alignment.. or (.. continued below)
>>>
>>> That will not work though? While the logger knows about the labels when
>>> receiving them, it does not know at all about future ones that might
>>> still come in. So it can happen that the padding does not fit anymore,
>>> and one gets even uglier formatting issues.
>>>
>>> I would prefer to keep them as is for the time being. For vm/ct it is
>>> not that bad, one could maybe forsee the ID to be a 4 digit number and
>>> define a minimum label lenght to be padded if not reached.
> 
> ack, let's leave this as it is for now, can always be bikeshedded if
> needed later on!
> 
>>>
>>> Host backups or backups with explicit label
>>>
> 
> cut off?
> 
>>>
>>>>> ...
>>>>> ```
>>>>>
>>>>> Example output for a parallel pull job:
>>>>> ```
>>>>> ...
>>>>> [ct/107]: 2025-07-16T09:14:01Z: start sync
>>>>> [ct/107]: 2025-07-16T09:14:01Z: pct.conf.blob: sync archive
>>>>> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: sync archive
>>>>> [vm/108]: 2025-09-19T07:37:19Z: start sync
>>>>> [vm/108]: 2025-09-19T07:37:19Z: qemu-server.conf.blob: sync archive
>>>>> [vm/108]: 2025-09-19T07:37:19Z: drive-scsi0.img.fidx: sync archive
>>>>> [ct/107]: 2025-07-16T09:14:01Z: root.ppxar.didx: downloaded 609.233
>>>>> MiB (112.628 MiB/s)
>>>>> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: sync archive
>>>>> [ct/107]: 2025-07-16T09:14:01Z: root.mpxar.didx: downloaded 1.172 MiB
>>>>> (17.838 MiB/s)
>>>>> [ct/107]: 2025-07-16T09:14:01Z: sync done
>>>>> [ct/107]: percentage done: 72.73% (8/11 groups)
>>>>
>>>> the way the prefix and snapshot are formatted could also be interpreted
>>>> at first glance as a timestamp of the log line.. why not just prepend
>>>> the prefix on the logger side, and leave it up to the caller to do the
>>>> formatting? then we could us "{type}/{id}/" as prefix here? or add
>>>> 'snapshot' in those lines to make it clear? granted, this is more of an
>>>> issue when viewing the log via `proxmox-backup-manager`, as in the UI we
>>>> have the log timestamps up front..
>>>
>>> I simply followed along the line of your suggestion here. In prior
>>> versions I had exactly that but you rejected it as to verbose? :)
>>>
>>> https://lore.proxmox.com/pbs-devel/1774263381.bngcrer2th.astroid@yuna.none/
>>>
> 
> my suggestions (for just the ct/107 lines above) would have been:
> 
> ct/107: 2025-07-16T09:14:01Z: start sync
> ct/107: 2025-07-16T09:14:01Z pct.conf.blob: sync archive
> ct/107: 2025-07-16T09:14:01Z root.ppxar.didx: sync archive
> ct/107: 2025-07-16T09:14:01Z root.ppxar.didx: downloaded 609.233  MiB (112.628 MiB/s)
> ct/107: 2025-07-16T09:14:01Z root.mpxar.didx: sync archive
> ct/107: 2025-07-16T09:14:01Z root.mpxar.didx: downloaded 1.172 MiB  (17.838 MiB/s)
> ct/107: 2025-07-16T09:14:01Z: sync done
> ct/107: percentage done: 72.73% (8/11 groups)
> 
> though it might make sense to group that even more visually
> 
> ct/107: 2025-07-16T09:14:01Z: start sync
> ct/107: 2025-07-16T09:14:01Z/pct.conf.blob: sync archive
> ct/107: 2025-07-16T09:14:01Z/root.ppxar.didx: sync archive
> ct/107: 2025-07-16T09:14:01Z/root.ppxar.didx: downloaded 609.233  MiB (112.628 MiB/s)
> ct/107: 2025-07-16T09:14:01Z/root.mpxar.didx: sync archive
> ct/107: 2025-07-16T09:14:01Z/root.mpxar.didx: downloaded 1.172 MiB  (17.838 MiB/s)
> ct/107: 2025-07-16T09:14:01Z: sync done
> ct/107: percentage done: 72.73% (8/11 groups)
> 
> or
> 
> [ct/107 2025-07-16T09:14:01Z] start sync
> [ct/107 2025-07-16T09:14:01Z pct.conf.blob] sync archive
> [ct/107 2025-07-16T09:14:01Z root.ppxar.didx] sync archive
> [ct/107 2025-07-16T09:14:01Z root.ppxar.didx] downloaded 609.233  MiB (112.628 MiB/s)
> [ct/107 2025-07-16T09:14:01Z root.mpxar.didx] sync archive
> [ct/107 2025-07-16T09:14:01Z root.mpxar.didx] downloaded 1.172 MiB  (17.838 MiB/s)
> [ct/107 2025-07-16T09:14:01Z] sync done
> [ct/107] percentage done: 72.73% (8/11 groups)
> 
> but that is probably a matter of taste, so let's postpone this aspect
> for now.
> 
>>>>
>>>> and maybe(?) log the progress line using a different prefix? because
>>>> right now the information that the group [ct/107] is finished is not
>>>> really clear from the output, IMHO.
>>>
>>> That I can add, yes.
> 
> thanks!
> 
>>>
>>>>
>>>> the progress logging is also still broken (this is for a sync that takes
>>>> a while, this is not log messages being buffered and re-ordered!):
>>>>
>>>> $ proxmox-backup-manager task log
>>>> 'UPID:yuna:00070656:001E420F:00000002:69E610EB:syncjob:local\x3atest\x3atank\x3a\x3as\x2dbc01cba6\x2d805a:root@pam:' | grep -e namespace -e 'percentage done'
>>>> Syncing datastore 'test', root namespace into datastore 'tank', root
>>>> namespace
>>>> Finished syncing root namespace, current progress: 0 groups, 0 snapshots
>>>> Syncing datastore 'test', namespace 'test' into datastore 'tank',
>>>> namespace 'test'
>>>> [host/exclusion-test]: percentage done: 5.26% (1/19 groups)
>>>> [host/acltest]: percentage done: 5.26% (1/19 groups)
>>>> [host/logtest]: percentage done: 5.26% (1/19 groups)
>>>> [host/onemeg]: percentage done: 5.26% (1/19 groups)
>>>> [host/fourmeg]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in
>>>> group #1)
>>>> [host/symlink]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in
>>>> group #1)
>>>> [host/symlink]: percentage done: 5.26% (1/19 groups)
>>>> [host/fourmeg]: percentage done: 5.26% (1/19 groups)
>>>> [host/format-v2-test]: percentage done: 1.75% (0/19 groups, 1/3
>>>> snapshots in group #1)
>>>> [host/format-v2-test]: percentage done: 3.51% (0/19 groups, 2/3
>>>> snapshots in group #1)
>>>> [host/format-v2-test]: percentage done: 5.26% (1/19 groups)
>>>> [host/incrementaltest2]: percentage done: 2.63% (0/19 groups, 1/2
>>>> snapshots in group #1)
>>>> [host/incrementaltest2]: percentage done: 5.26% (1/19 groups)
>>>
>>> There will always be cases where it does not work I guess, not sure what
>>> I can do to make you happy here.
>>
>> Oh, sorry, might have been to quick on this one. You are saying this
>> happens for a non-parallel sync?
> 
> no, this was a parallel sync, but the progress should be correct for
> parallel syncs as well, shouldn't it?
> 
> here's the same log but less truncated:
> 
> [..]
> 2026-04-20T13:41:31+02:00: ----
> 2026-04-20T13:41:31+02:00: Syncing datastore 'test', namespace 'test' into datastore 'tank', namespace 'test'
> 2026-04-20T13:41:31+02:00: Found 19 groups to sync (out of 19 total)
> 2026-04-20T13:41:31+02:00: [ct/106]: sync group ct/106 failed - owner check failed (test@pbs != root@pam)
> 2026-04-20T13:41:31+02:00: [ct/999]: sync group ct/999 failed - owner check failed (test@pbs != root@pam)
> 2026-04-20T13:41:31+02:00: skipping snapshot host/test-another-mail/2024-03-14T08:44:25Z - in-progress backup
> 2026-04-20T13:41:31+02:00: [host/exclusion-test]: host/exclusion-test: skipped: 1 snapshot(s) (2024-04-23T06:53:44Z) - older than the newest snapshot present on sync target
> 2026-04-20T13:41:31+02:00: [host/exclusion-test]: re-sync snapshot 2024-04-23T06:54:30Z
> 2026-04-20T13:41:31+02:00: [host/exclusion-test]: 2024-04-23T06:54:30Z: no data changes
> 2026-04-20T13:41:31+02:00: [host/exclusion-test]: percentage done: 5.26% (1/19 groups)
> 2026-04-20T13:41:31+02:00: [host/acltest]: re-sync snapshot 2024-10-07T11:17:58Z
> 2026-04-20T13:41:31+02:00: [host/acltest]: 2024-10-07T11:17:58Z: no data changes
> 2026-04-20T13:41:31+02:00: [host/acltest]: percentage done: 5.26% (1/19 groups)
> 2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: start sync
> 2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: test.pxar.didx: sync archive
> 2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: test.pxar.didx: downloaded 13.169 KiB (168.332 KiB/s)
> 2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: catalog.pcat1.didx: sync archive
> 2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: catalog.pcat1.didx: downloaded 2.157 KiB (10.2 KiB/s)
> 2026-04-20T13:41:32+02:00: [host/logtest]: 2024-01-30T10:52:34Z: rsa-encrypted.key.blob: sync archive
> 2026-04-20T13:41:33+02:00: [host/logtest]: 2024-01-30T10:52:34Z: sync done
> 2026-04-20T13:41:33+02:00: [host/logtest]: percentage done: 5.26% (1/19 groups)
> 2026-04-20T13:41:33+02:00: [host/onemeg]: 2023-06-28T11:13:51Z: start sync
> 2026-04-20T13:41:33+02:00: [host/onemeg]: 2023-06-28T11:13:51Z: test.img.fidx: sync archive
> 2026-04-20T13:41:33+02:00: [host/onemeg]: 2023-06-28T11:13:51Z: test.img.fidx: downloaded 2.012 KiB (5.704 KiB/s)
> 2026-04-20T13:41:33+02:00: [host/onemeg]: 2023-06-28T11:13:51Z: sync done
> 2026-04-20T13:41:33+02:00: [host/onemeg]: percentage done: 5.26% (1/19 groups)
> 2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T11:14:35Z: start sync
> 2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T11:14:35Z: test.img.fidx: sync archive
> 2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T11:14:35Z: test.img.fidx: downloaded 2.012 KiB (5.885 KiB/s)
> 2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T11:14:35Z: sync done
> 2026-04-20T13:41:33+02:00: [host/fourmeg]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
> 2026-04-20T13:41:33+02:00: [host/fourmeg]: 2023-06-28T12:37:16Z: start sync
> 2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: start sync
> 2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: symlink.pxar.didx: sync archive
> 2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: symlink.pxar.didx: downloaded 145 B (802.587 B/s)
> 2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: catalog.pcat1.didx: sync archive
> 2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: catalog.pcat1.didx: downloaded 57 B (279.566 B/s)
> 2026-04-20T13:41:34+02:00: [host/symlink]: 2023-06-07T06:48:26Z: sync done
> 2026-04-20T13:41:37+02:00: [host/incrementaltest2]: 2023-09-25T13:43:25Z: start sync
> 2026-04-20T13:41:37+02:00: [host/incrementaltest2]: 2023-09-25T13:43:25Z: test.pxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/format-v3-test]: 2024-06-03T10:34:11Z: start sync
> 2026-04-20T13:41:37+02:00: [host/format-v3-test]: 2024-06-03T10:34:11Z: linux.ppxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/fourmeg]: 2023-06-28T12:37:16Z: test.img.fidx: sync archive
> 2026-04-20T13:41:37+02:00: [host/linuxtesttest]: 2024-07-17T13:38:31Z: start sync
> 2026-04-20T13:41:37+02:00: [host/linuxtesttest]: 2024-07-17T13:38:31Z: linux.pxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/inctest]: 2023-11-13T13:22:57Z: start sync
> 2026-04-20T13:41:37+02:00: [host/inctest]: 2023-11-13T13:22:57Z: test.pxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/inctest2]: 2023-11-13T13:53:27Z: start sync
> 2026-04-20T13:41:37+02:00: [host/inctest2]: 2023-11-13T13:53:27Z: test.pxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/format-v2-test]: 2024-06-07T12:08:37Z: start sync
> 2026-04-20T13:41:37+02:00: [host/format-v2-test]: 2024-06-07T12:08:37Z: linux.ppxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/foobar]: re-sync snapshot 2024-06-10T07:48:51Z due to corruption
> 2026-04-20T13:41:37+02:00: [host/foobar]: 2024-06-10T07:48:51Z: foobar.ppxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/symlink]: percentage done: 2.63% (0/19 groups, 1/2 snapshots in group #1)
> 2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: start sync
> 2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: foo.pxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: foo.pxar.didx: downloaded 0 B (0 B/s)
> 2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: catalog.pcat1.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/bookworm]: re-sync snapshot 2023-01-18T11:10:34Z due to corruption
> 2026-04-20T13:41:37+02:00: [host/bookworm]: 2023-01-18T11:10:34Z: repo.pxar.didx: sync archive
> 2026-04-20T13:41:37+02:00: [host/test]: 2024-07-10T08:18:51Z: start sync
> 2026-04-20T13:41:37+02:00: [host/test]: 2024-07-10T08:18:51Z: test.img.fidx: sync archive
> 2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: catalog.pcat1.didx: downloaded 53 B (216.705 B/s)
> 2026-04-20T13:41:37+02:00: [host/symlink]: 2023-06-07T06:49:43Z: sync done
> 2026-04-20T13:41:37+02:00: [host/symlink]: percentage done: 5.26% (1/19 groups)
> 2026-04-20T13:42:55+02:00: [host/fourmeg]: 2023-06-28T12:37:16Z: test.img.fidx: downloaded 4 MiB (51.191 KiB/s)
> 2026-04-20T13:42:55+02:00: [host/fourmeg]: 2023-06-28T12:37:16Z: sync done
> 2026-04-20T13:42:55+02:00: [host/fourmeg]: percentage done: 5.26% (1/19 groups)
> 
> those percentage lines are just complete bogus?? and there are at least
> two "breaks" of over a second where all buffered log lines must have
> been flushed, so it's also not that the lines with lower perecentage
> were re-ordered somehow..
> 
> it goes on and on and the percentage always stays at at most 5.26% since
> the finished group count never goes higher than 1.. though I have to
> admit I stopped it before it finished, so maybe it would have later gone
> higher? I can re-run it to completion and check what it says before it's
> done?
That is because the store progress calculates the percent based on the 
groups and snapshots within that group. But since groups are now being 
pulled in parallel, the percent is per group... Not sure if and how to 
fix that.




^ permalink raw reply	[flat|nested] 29+ messages in thread

* superseded: [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs
  2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
                   ` (15 preceding siblings ...)
  2026-04-20 12:33 ` [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Fabian Grünbichler
@ 2026-04-21 10:28 ` Christian Ebner
  16 siblings, 0 replies; 29+ messages in thread
From: Christian Ebner @ 2026-04-21 10:28 UTC (permalink / raw)
  To: pbs-devel

superseded-by version 7:
https://lore.proxmox.com/pbs-devel/20260421102654.610007-1-c.ebner@proxmox.com/T/




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context
  2026-04-20 11:56   ` Fabian Grünbichler
  2026-04-21  7:21     ` Christian Ebner
@ 2026-04-21 12:57     ` Thomas Lamprecht
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Lamprecht @ 2026-04-21 12:57 UTC (permalink / raw)
  To: Fabian Grünbichler, Christian Ebner, pbs-devel

Am 20.04.26 um 13:55 schrieb Fabian Grünbichler:
> should we pad the labels? the buffered logger knows all currently active
> labels and could adapt it? otherwise especially for host backups the
> logs are again not really scannable by humans, because there will be
> weird jumps in alignment.. or (.. continued below)

IMO it might be better to keep these simple and plain and rather do any
"touching up" on the frontend UI side when rendering such a log.
For now that would need to be a parser for some specific task-log types
to make them aware of the format, which is naturally not that great, in
the mid/long term it might be much nicer to have some structured logging
available for task logs, as then we could encode more info that a task log
render could either render nicer or also allow filtering on.

Am 20.04.26 um 13:55 schrieb Fabian Grünbichler:
> the way the prefix and snapshot are formatted could also be interpreted
> at first glance as a timestamp of the log line.. why not just prepend
> the prefix on the logger side, and leave it up to the caller to do the
> formatting? then we could us "{type}/{id}/" as prefix here? or add
> 'snapshot' in those lines to make it clear? granted, this is more of an
> issue when viewing the log via `proxmox-backup-manager`, as in the UI we
> have the log timestamps up front..

ah, my comment above is partially overlapping with this part of the
reply. And I'd also always prefix the time upfront, IMO a bit odd if
it's in the "middle". But more importantly, I'd not sweat the details
here that much and would rather target staying closer with pre-existing
format (to avoid going back-and-forth), we can still improve on that later.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-04-21 12:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-17  9:26 [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox v6 01/15] pbs api types: add `worker-threads` to sync job config Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 02/15] tools: group and sort module imports Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 03/15] tools: implement buffered logger for concurrent log messages Christian Ebner
2026-04-20 10:57   ` Fabian Grünbichler
2026-04-20 17:15     ` Christian Ebner
2026-04-21  6:49       ` Fabian Grünbichler
2026-04-17  9:26 ` [PATCH proxmox-backup v6 04/15] tools: add bounded join set to run concurrent tasks bound by limit Christian Ebner
2026-04-20 11:15   ` Fabian Grünbichler
2026-04-17  9:26 ` [PATCH proxmox-backup v6 05/15] client: backup writer: fix upload stats size and rate for push sync Christian Ebner
2026-04-20 12:29   ` Fabian Grünbichler
2026-04-17  9:26 ` [PATCH proxmox-backup v6 06/15] api: config/sync: add optional `worker-threads` property Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 07/15] sync: pull: revert avoiding reinstantiation for encountered chunks map Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 08/15] sync: pull: factor out backup group locking and owner check Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 09/15] sync: pull: prepare pull parameters to be shared across parallel tasks Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 10/15] fix #4182: server: sync: allow pulling backup groups in parallel Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 11/15] server: pull: prefix log messages and add error context Christian Ebner
2026-04-20 11:56   ` Fabian Grünbichler
2026-04-21  7:21     ` Christian Ebner
2026-04-21  7:42       ` Christian Ebner
2026-04-21  8:00         ` Fabian Grünbichler
2026-04-21  8:04           ` Christian Ebner
2026-04-21 12:57     ` Thomas Lamprecht
2026-04-17  9:26 ` [PATCH proxmox-backup v6 12/15] sync: push: prepare push parameters to be shared across parallel tasks Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 13/15] server: sync: allow pushing groups concurrently Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 14/15] server: push: prefix log messages and add additional logging Christian Ebner
2026-04-17  9:26 ` [PATCH proxmox-backup v6 15/15] ui: expose group worker setting in sync job edit window Christian Ebner
2026-04-20 12:33 ` [PATCH proxmox{,-backup} v6 00/15] fix #4182: concurrent group pull/push support for sync jobs Fabian Grünbichler
2026-04-21 10:28 ` superseded: " Christian Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal