public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup v2] fix #6566: backup: api: conditionally drop group and snapshot locks
Date: Tue, 30 Sep 2025 10:44:11 +0200	[thread overview]
Message-ID: <1759221679.xvpl4bbm4p.astroid@yuna.none> (raw)
In-Reply-To: <20250929151759.844346-1-c.ebner@proxmox.com>

On September 29, 2025 5:17 pm, Christian Ebner wrote:
> To guarantee consistency by possible concurrent operations, the
> backup protocol locks the backup group, the previous backup
> snapshot (if any) and holds a lock for the newly created backup
> snapshot. All of these are currently stored in the backup worker
> task, only released on its destruction.
> 
> The backup API however signals a successful backup via the return
> status of the `finish` call, while still holding the locks.
> Therefore, an immediate subsequent backup of the client to the same
> group can fail because the locks cannot be acquired until the previous
> backup task is completely destroyed, which can however outlive the
> `finish` return for some time. This manifests in e.g. a push sync job
> failing.
> 
> To fix this, store the lock guards inside the RPC environments shared
> state instead, allowing to selectively drop the locks on successful
> backup finish. On error, hold the locks until the cleanup was
> successful.
> 
> Immediate verification of new snapshots already downgraded the lock
> by dropping the exclusive lock and getting a shared lock. Since the
> dropping is now already handled by the finish call, only gathering
> the shared lock is required. While there is now a larger time window
> for concurrent prunes, the underlying possible race between
> verification and prune remains in place.
> 
> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=6566
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>

it should be very rare that a snapshot is attempted to be removed right
as it was created, so the slightly increased race window there should be
okay.

did you test this with vzdump's `protected` option and verify after
completion?

> ---
> Changes since version 1:
> - rebased onto current master
> 
>  src/api2/backup/environment.rs | 38 +++++++++++++++++++++++++++++-----
>  src/api2/backup/mod.rs         | 18 +++++++++-------
>  2 files changed, 43 insertions(+), 13 deletions(-)
> 
> diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
> index ace305d7e..fa1444ab1 100644
> --- a/src/api2/backup/environment.rs
> +++ b/src/api2/backup/environment.rs
> @@ -96,6 +96,27 @@ struct SharedBackupState {
>      known_chunks: KnownChunksMap,
>      backup_size: u64, // sums up size of all files
>      backup_stat: UploadStatistic,
> +    backup_lock_guards: BackupLockGuards,
> +}
> +
> +pub struct BackupLockGuards {
> +    previous_snapshot: Option<BackupLockGuard>,
> +    group: Option<BackupLockGuard>,
> +    snapshot: Option<BackupLockGuard>,
> +}
> +
> +impl BackupLockGuards {
> +    pub(crate) fn new(
> +        previous_snapshot: Option<BackupLockGuard>,
> +        group: BackupLockGuard,
> +        snapshot: BackupLockGuard,
> +    ) -> Self {
> +        Self {
> +            previous_snapshot,
> +            group: Some(group),
> +            snapshot: Some(snapshot),
> +        }
> +    }
>  }
>  
>  impl SharedBackupState {
> @@ -140,6 +161,7 @@ impl BackupEnvironment {
>          datastore: Arc<DataStore>,
>          backup_dir: BackupDir,
>          no_cache: bool,
> +        backup_lock_guards: BackupLockGuards,
>      ) -> Result<Self, Error> {
>          let state = SharedBackupState {
>              finished: BackupState::Active,
> @@ -150,6 +172,7 @@ impl BackupEnvironment {
>              known_chunks: HashMap::new(),
>              backup_size: 0,
>              backup_stat: UploadStatistic::new(),
> +            backup_lock_guards,
>          };
>  
>          let backend = datastore.backend()?;
> @@ -719,6 +742,8 @@ impl BackupEnvironment {
>                  );
>              }
>          }
> +        // drop previous snapshot lock
> +        state.backup_lock_guards.previous_snapshot.take();
>  
>          let stats = serde_json::to_value(state.backup_stat)?;
>  
> @@ -744,13 +769,17 @@ impl BackupEnvironment {
>          // marks the backup as successful
>          state.finished = BackupState::Finished;
>  
> +        // drop snapshot and group lock only here so any error above will lead to
> +        // the locks still being held in the env for the backup cleanup.
> +        state.backup_lock_guards.snapshot.take();
> +        state.backup_lock_guards.group.take();
> +
>          Ok(())
>      }
>  
>      /// If verify-new is set on the datastore, this will run a new verify task
> -    /// for the backup. If not, this will return and also drop the passed lock
> -    /// immediately.
> -    pub fn verify_after_complete(&self, excl_snap_lock: BackupLockGuard) -> Result<(), Error> {
> +    /// for the backup. If not, this will return.
> +    pub fn verify_after_complete(&self) -> Result<(), Error> {
>          self.ensure_finished()?;
>  
>          if !self.datastore.verify_new() {
> @@ -758,8 +787,7 @@ impl BackupEnvironment {
>              return Ok(());
>          }
>  
> -        // Downgrade to shared lock, the backup itself is finished
> -        drop(excl_snap_lock);
> +        // Get shared lock, the backup itself is finished
>          let snap_lock = self.backup_dir.lock_shared().with_context(|| {
>              format!(
>                  "while trying to verify snapshot '{:?}' after completion",
> diff --git a/src/api2/backup/mod.rs b/src/api2/backup/mod.rs
> index 8a076a2b0..246b0946d 100644
> --- a/src/api2/backup/mod.rs
> +++ b/src/api2/backup/mod.rs
> @@ -144,7 +144,7 @@ fn upgrade_to_backup_protocol(
>          };
>  
>          // lock backup group to only allow one backup per group at a time
> -        let (owner, _group_guard) = datastore.create_locked_backup_group(
> +        let (owner, group_guard) = datastore.create_locked_backup_group(
>              backup_group.backup_ns(),
>              backup_group.as_ref(),
>              &auth_id,
> @@ -183,7 +183,7 @@ fn upgrade_to_backup_protocol(
>  
>          let backup_dir = backup_group.backup_dir(backup_dir_arg.time)?;
>  
> -        let _last_guard = if let Some(last) = &last_backup {
> +        let last_guard = if let Some(last) = &last_backup {
>              if backup_dir.backup_time() <= last.backup_dir.backup_time() {
>                  bail!("backup timestamp is older than last backup.");
>              }
> @@ -210,6 +210,12 @@ fn upgrade_to_backup_protocol(
>              auth_id.to_string(),
>              true,
>              move |worker| async move {
> +                // Keep flock for the backup runtime by storing guards in backup env shared state.
> +                // Drop them on successful backup finish or when dropping the env after cleanup in
> +                // case of errors. The former is required for immediate subsequent backups (e.g.
> +                // during a push sync) to be able to lock the group and snapshots.
> +                let backup_lock_guards = BackupLockGuards::new(last_guard, group_guard, snap_guard);
> +
>                  let mut env = BackupEnvironment::new(
>                      env_type,
>                      auth_id,
> @@ -217,6 +223,7 @@ fn upgrade_to_backup_protocol(
>                      datastore,
>                      backup_dir,
>                      no_cache,
> +                    backup_lock_guards,
>                  )?;
>  
>                  env.debug = debug;
> @@ -271,11 +278,6 @@ fn upgrade_to_backup_protocol(
>                      });
>                  let mut abort_future = abort_future.map(|_| Err(format_err!("task aborted")));
>  
> -                // keep flock until task ends
> -                let _group_guard = _group_guard;
> -                let snap_guard = snap_guard;
> -                let _last_guard = _last_guard;
> -
>                  let res = select! {
>                      req = req_fut => req,
>                      abrt = abort_future => abrt,
> @@ -293,7 +295,7 @@ fn upgrade_to_backup_protocol(
>                  }
>  
>                  let verify = |env: BackupEnvironment| {
> -                    if let Err(err) = env.verify_after_complete(snap_guard) {
> +                    if let Err(err) = env.verify_after_complete() {
>                          env.log(format!(
>                              "backup finished, but starting the requested verify task failed: {}",
>                              err
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

  reply	other threads:[~2025-09-30  8:44 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-29 15:17 Christian Ebner
2025-09-30  8:44 ` Fabian Grünbichler [this message]
2025-09-30  9:21   ` Christian Ebner
2025-10-01  8:59 ` [pbs-devel] applied: " Fabian Grünbichler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1759221679.xvpl4bbm4p.astroid@yuna.none \
    --to=f.gruenbichler@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal