all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Christian Ebner <c.ebner@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [pbs-devel] [PATCH proxmox-backup] fix #6566: backup: api: conditionally drop group and snapshot locks
Date: Tue, 23 Sep 2025 15:32:22 +0200	[thread overview]
Message-ID: <20250923133222.582944-1-c.ebner@proxmox.com> (raw)

To guarantee consistency by possible concurrent operations, the
backup protocol locks the backup group, the previous backup
snapshot (if any) and holds a lock for the newly created backup
snapshot. All of these are currently stored in the backup worker
task, only released on its destruction.

The backup API however signals a successful backup via the return
status of the `finish` call, while still holding the locks.
Therefore, an immediate subsequent backup of the client to the same
group can fail because the locks cannot be acquired until the previous
backup task is completely destroyed, which can however outlive the
`finish` return for some time. This manifests in e.g. a push sync job
failing.

To fix this, store the lock guards inside the RPC environments shared
state instead, allowing to selectively drop the locks on successful
backup finish. On error, hold the locks until the cleanup was
successful.

Immediate verification of new snapshots already downgraded the lock
by dropping the exclusive lock and getting a shared lock. Since the
dropping is now already handled by the finish call, only gathering
the shared lock is required. While there is now a larger time window
for concurrent prunes, the underlying possible race between
verification and prune remains in place.

Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=6566
Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
Note: With the intention to also backport this to stable-3 if
accepted for current master.

 src/api2/backup/environment.rs | 38 +++++++++++++++++++++++++++++-----
 src/api2/backup/mod.rs         | 18 +++++++++-------
 2 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index d5e6869cd..489da0720 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -87,6 +87,27 @@ struct SharedBackupState {
     known_chunks: KnownChunksMap,
     backup_size: u64, // sums up size of all files
     backup_stat: UploadStatistic,
+    backup_lock_guards: BackupLockGuards,
+}
+
+pub struct BackupLockGuards {
+    previous_snapshot: Option<BackupLockGuard>,
+    group: Option<BackupLockGuard>,
+    snapshot: Option<BackupLockGuard>,
+}
+
+impl BackupLockGuards {
+    pub(crate) fn new(
+        previous_snapshot: Option<BackupLockGuard>,
+        group: BackupLockGuard,
+        snapshot: BackupLockGuard,
+    ) -> Self {
+        Self {
+            previous_snapshot,
+            group: Some(group),
+            snapshot: Some(snapshot),
+        }
+    }
 }
 
 impl SharedBackupState {
@@ -130,6 +151,7 @@ impl BackupEnvironment {
         datastore: Arc<DataStore>,
         backup_dir: BackupDir,
         no_cache: bool,
+        backup_lock_guards: BackupLockGuards,
     ) -> Result<Self, Error> {
         let state = SharedBackupState {
             finished: false,
@@ -140,6 +162,7 @@ impl BackupEnvironment {
             known_chunks: HashMap::new(),
             backup_size: 0,
             backup_stat: UploadStatistic::new(),
+            backup_lock_guards,
         };
 
         let backend = datastore.backend()?;
@@ -658,6 +681,8 @@ impl BackupEnvironment {
                 );
             }
         }
+        // drop previous snapshot lock
+        state.backup_lock_guards.previous_snapshot.take();
 
         // check for valid manifest and store stats
         let stats = serde_json::to_value(state.backup_stat)?;
@@ -672,13 +697,17 @@ impl BackupEnvironment {
         // marks the backup as successful
         state.finished = true;
 
+        // drop snapshot and group lock only here so any error above will lead to
+        // the locks still being held in the env for the backup cleanup.
+        state.backup_lock_guards.snapshot.take();
+        state.backup_lock_guards.group.take();
+
         Ok(())
     }
 
     /// If verify-new is set on the datastore, this will run a new verify task
-    /// for the backup. If not, this will return and also drop the passed lock
-    /// immediately.
-    pub fn verify_after_complete(&self, excl_snap_lock: BackupLockGuard) -> Result<(), Error> {
+    /// for the backup. If not, this will return.
+    pub fn verify_after_complete(&self) -> Result<(), Error> {
         self.ensure_finished()?;
 
         if !self.datastore.verify_new() {
@@ -686,8 +715,7 @@ impl BackupEnvironment {
             return Ok(());
         }
 
-        // Downgrade to shared lock, the backup itself is finished
-        drop(excl_snap_lock);
+        // Get shared lock, the backup itself is finished
         let snap_lock = self.backup_dir.lock_shared().with_context(|| {
             format!(
                 "while trying to verify snapshot '{:?}' after completion",
diff --git a/src/api2/backup/mod.rs b/src/api2/backup/mod.rs
index ae61ff697..eb28dce1e 100644
--- a/src/api2/backup/mod.rs
+++ b/src/api2/backup/mod.rs
@@ -144,7 +144,7 @@ fn upgrade_to_backup_protocol(
         };
 
         // lock backup group to only allow one backup per group at a time
-        let (owner, _group_guard) = datastore.create_locked_backup_group(
+        let (owner, group_guard) = datastore.create_locked_backup_group(
             backup_group.backup_ns(),
             backup_group.as_ref(),
             &auth_id,
@@ -183,7 +183,7 @@ fn upgrade_to_backup_protocol(
 
         let backup_dir = backup_group.backup_dir(backup_dir_arg.time)?;
 
-        let _last_guard = if let Some(last) = &last_backup {
+        let last_guard = if let Some(last) = &last_backup {
             if backup_dir.backup_time() <= last.backup_dir.backup_time() {
                 bail!("backup timestamp is older than last backup.");
             }
@@ -210,6 +210,12 @@ fn upgrade_to_backup_protocol(
             auth_id.to_string(),
             true,
             move |worker| async move {
+                // Keep flock for the backup runtime by storing guards in backup env shared state.
+                // Drop them on successful backup finish or when dropping the env after cleanup in
+                // case of errors. The former is required for immediate subsequent backups (e.g.
+                // during a push sync) to be able to lock the group and snapshots.
+                let backup_lock_guards = BackupLockGuards::new(last_guard, group_guard, snap_guard);
+
                 let mut env = BackupEnvironment::new(
                     env_type,
                     auth_id,
@@ -217,6 +223,7 @@ fn upgrade_to_backup_protocol(
                     datastore,
                     backup_dir,
                     no_cache,
+                    backup_lock_guards,
                 )?;
 
                 env.debug = debug;
@@ -271,11 +278,6 @@ fn upgrade_to_backup_protocol(
                     });
                 let mut abort_future = abort_future.map(|_| Err(format_err!("task aborted")));
 
-                // keep flock until task ends
-                let _group_guard = _group_guard;
-                let snap_guard = snap_guard;
-                let _last_guard = _last_guard;
-
                 let res = select! {
                     req = req_fut => req,
                     abrt = abort_future => abrt,
@@ -287,7 +289,7 @@ fn upgrade_to_backup_protocol(
                 }
 
                 let verify = |env: BackupEnvironment| {
-                    if let Err(err) = env.verify_after_complete(snap_guard) {
+                    if let Err(err) = env.verify_after_complete() {
                         env.log(format!(
                             "backup finished, but starting the requested verify task failed: {}",
                             err
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


                 reply	other threads:[~2025-09-23 13:32 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250923133222.582944-1-c.ebner@proxmox.com \
    --to=c.ebner@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal