From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox Backup Server development discussion
<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup v2] fix #6566: backup: api: conditionally drop group and snapshot locks
Date: Tue, 30 Sep 2025 10:44:11 +0200 [thread overview]
Message-ID: <1759221679.xvpl4bbm4p.astroid@yuna.none> (raw)
In-Reply-To: <20250929151759.844346-1-c.ebner@proxmox.com>
On September 29, 2025 5:17 pm, Christian Ebner wrote:
> To guarantee consistency by possible concurrent operations, the
> backup protocol locks the backup group, the previous backup
> snapshot (if any) and holds a lock for the newly created backup
> snapshot. All of these are currently stored in the backup worker
> task, only released on its destruction.
>
> The backup API however signals a successful backup via the return
> status of the `finish` call, while still holding the locks.
> Therefore, an immediate subsequent backup of the client to the same
> group can fail because the locks cannot be acquired until the previous
> backup task is completely destroyed, which can however outlive the
> `finish` return for some time. This manifests in e.g. a push sync job
> failing.
>
> To fix this, store the lock guards inside the RPC environments shared
> state instead, allowing to selectively drop the locks on successful
> backup finish. On error, hold the locks until the cleanup was
> successful.
>
> Immediate verification of new snapshots already downgraded the lock
> by dropping the exclusive lock and getting a shared lock. Since the
> dropping is now already handled by the finish call, only gathering
> the shared lock is required. While there is now a larger time window
> for concurrent prunes, the underlying possible race between
> verification and prune remains in place.
>
> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=6566
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
it should be very rare that a snapshot is attempted to be removed right
as it was created, so the slightly increased race window there should be
okay.
did you test this with vzdump's `protected` option and verify after
completion?
> ---
> Changes since version 1:
> - rebased onto current master
>
> src/api2/backup/environment.rs | 38 +++++++++++++++++++++++++++++-----
> src/api2/backup/mod.rs | 18 +++++++++-------
> 2 files changed, 43 insertions(+), 13 deletions(-)
>
> diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
> index ace305d7e..fa1444ab1 100644
> --- a/src/api2/backup/environment.rs
> +++ b/src/api2/backup/environment.rs
> @@ -96,6 +96,27 @@ struct SharedBackupState {
> known_chunks: KnownChunksMap,
> backup_size: u64, // sums up size of all files
> backup_stat: UploadStatistic,
> + backup_lock_guards: BackupLockGuards,
> +}
> +
> +pub struct BackupLockGuards {
> + previous_snapshot: Option<BackupLockGuard>,
> + group: Option<BackupLockGuard>,
> + snapshot: Option<BackupLockGuard>,
> +}
> +
> +impl BackupLockGuards {
> + pub(crate) fn new(
> + previous_snapshot: Option<BackupLockGuard>,
> + group: BackupLockGuard,
> + snapshot: BackupLockGuard,
> + ) -> Self {
> + Self {
> + previous_snapshot,
> + group: Some(group),
> + snapshot: Some(snapshot),
> + }
> + }
> }
>
> impl SharedBackupState {
> @@ -140,6 +161,7 @@ impl BackupEnvironment {
> datastore: Arc<DataStore>,
> backup_dir: BackupDir,
> no_cache: bool,
> + backup_lock_guards: BackupLockGuards,
> ) -> Result<Self, Error> {
> let state = SharedBackupState {
> finished: BackupState::Active,
> @@ -150,6 +172,7 @@ impl BackupEnvironment {
> known_chunks: HashMap::new(),
> backup_size: 0,
> backup_stat: UploadStatistic::new(),
> + backup_lock_guards,
> };
>
> let backend = datastore.backend()?;
> @@ -719,6 +742,8 @@ impl BackupEnvironment {
> );
> }
> }
> + // drop previous snapshot lock
> + state.backup_lock_guards.previous_snapshot.take();
>
> let stats = serde_json::to_value(state.backup_stat)?;
>
> @@ -744,13 +769,17 @@ impl BackupEnvironment {
> // marks the backup as successful
> state.finished = BackupState::Finished;
>
> + // drop snapshot and group lock only here so any error above will lead to
> + // the locks still being held in the env for the backup cleanup.
> + state.backup_lock_guards.snapshot.take();
> + state.backup_lock_guards.group.take();
> +
> Ok(())
> }
>
> /// If verify-new is set on the datastore, this will run a new verify task
> - /// for the backup. If not, this will return and also drop the passed lock
> - /// immediately.
> - pub fn verify_after_complete(&self, excl_snap_lock: BackupLockGuard) -> Result<(), Error> {
> + /// for the backup. If not, this will return.
> + pub fn verify_after_complete(&self) -> Result<(), Error> {
> self.ensure_finished()?;
>
> if !self.datastore.verify_new() {
> @@ -758,8 +787,7 @@ impl BackupEnvironment {
> return Ok(());
> }
>
> - // Downgrade to shared lock, the backup itself is finished
> - drop(excl_snap_lock);
> + // Get shared lock, the backup itself is finished
> let snap_lock = self.backup_dir.lock_shared().with_context(|| {
> format!(
> "while trying to verify snapshot '{:?}' after completion",
> diff --git a/src/api2/backup/mod.rs b/src/api2/backup/mod.rs
> index 8a076a2b0..246b0946d 100644
> --- a/src/api2/backup/mod.rs
> +++ b/src/api2/backup/mod.rs
> @@ -144,7 +144,7 @@ fn upgrade_to_backup_protocol(
> };
>
> // lock backup group to only allow one backup per group at a time
> - let (owner, _group_guard) = datastore.create_locked_backup_group(
> + let (owner, group_guard) = datastore.create_locked_backup_group(
> backup_group.backup_ns(),
> backup_group.as_ref(),
> &auth_id,
> @@ -183,7 +183,7 @@ fn upgrade_to_backup_protocol(
>
> let backup_dir = backup_group.backup_dir(backup_dir_arg.time)?;
>
> - let _last_guard = if let Some(last) = &last_backup {
> + let last_guard = if let Some(last) = &last_backup {
> if backup_dir.backup_time() <= last.backup_dir.backup_time() {
> bail!("backup timestamp is older than last backup.");
> }
> @@ -210,6 +210,12 @@ fn upgrade_to_backup_protocol(
> auth_id.to_string(),
> true,
> move |worker| async move {
> + // Keep flock for the backup runtime by storing guards in backup env shared state.
> + // Drop them on successful backup finish or when dropping the env after cleanup in
> + // case of errors. The former is required for immediate subsequent backups (e.g.
> + // during a push sync) to be able to lock the group and snapshots.
> + let backup_lock_guards = BackupLockGuards::new(last_guard, group_guard, snap_guard);
> +
> let mut env = BackupEnvironment::new(
> env_type,
> auth_id,
> @@ -217,6 +223,7 @@ fn upgrade_to_backup_protocol(
> datastore,
> backup_dir,
> no_cache,
> + backup_lock_guards,
> )?;
>
> env.debug = debug;
> @@ -271,11 +278,6 @@ fn upgrade_to_backup_protocol(
> });
> let mut abort_future = abort_future.map(|_| Err(format_err!("task aborted")));
>
> - // keep flock until task ends
> - let _group_guard = _group_guard;
> - let snap_guard = snap_guard;
> - let _last_guard = _last_guard;
> -
> let res = select! {
> req = req_fut => req,
> abrt = abort_future => abrt,
> @@ -293,7 +295,7 @@ fn upgrade_to_backup_protocol(
> }
>
> let verify = |env: BackupEnvironment| {
> - if let Err(err) = env.verify_after_complete(snap_guard) {
> + if let Err(err) = env.verify_after_complete() {
> env.log(format!(
> "backup finished, but starting the requested verify task failed: {}",
> err
> --
> 2.47.3
>
>
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>
>
>
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
next prev parent reply other threads:[~2025-09-30 8:44 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-29 15:17 Christian Ebner
2025-09-30 8:44 ` Fabian Grünbichler [this message]
2025-09-30 9:21 ` Christian Ebner
2025-10-01 8:59 ` [pbs-devel] applied: " Fabian Grünbichler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1759221679.xvpl4bbm4p.astroid@yuna.none \
--to=f.gruenbichler@proxmox.com \
--cc=pbs-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox